MTDIS

advertisement
********************
********************
********************
********************
********************
MTDIS
*******************
*******************
*******************
*******************
*******************
Welcome to the mitochondrial DNA distance analysis program (MTDIS). This program was
initially written to calculate genetic distances between populations for restriction site data
obtained from single copy DNA sequences (eg. mtDNA). The present version of MTDIS will
also calculate inter-population genetic distances for DNA sequence data obtained from single
copy DNA.
The algorithm used compares frequency differences between populations for the
presence of a restriction site or polymorphic nucleotide bases across all haplotypes present in
both populations. The formulation of the distance measure is taken from the one given on page
651, of: Danzmann et al. 1991. J. Fish Biol. 39:649-659.
Note: The minimum number of populations being compared should be 3
The present version of the program will allow genetic distances to be calculated between single
copy DNA (eg. mtDNA) haplotypes for:
1) restriction site data
2) DNA sequences
Note: only a p-distance estimation of sequence divergence is made between
haplotypes where:
p = nd / n
nd = the number of base pair (bp) differences between haplotypes, and
n = the total number of bp sequenced.
Thus, MTDIS is most appropriate for analyzing recently diverged conspecific populations where
pairwise divergences are small. The program should not be used for interspecific phylogenetic
analyses.
Two types of data sets may be analyzed with MTDIS.
1. Single data sets that would generally consist of the empirical data.
2. bootstrapped data sets that can be used to test confidence limits on the empirical data set.
1
Bootstrapped data sets can be obtained by modifying your input data set according to the file
format for SEQBOOT in PHYLIP (see below)(Felsenstein, 1993). The randomized data sets
may then be generated by running SEQBOOT to the specified number of randomizations desired.
Once the output file in SEQBOOT is generated the first few lines of the file will need to be
modified before the analyzing the data for restriction site or DNA sequence differences.
SEQBOOT is available in version 3.5 of PHYLIP.
________________________________________________________________________
HAPLOTYPE DESCRIPTION INPUT FILES:
1. RESTRICTION SITE and DNA SEQUENCE DATA:
Two types of input files are required for MTDIS. One file consists of the population data set
and contains information on the population name designations (maximum of 8 letters). The
names of the haplotypic designations (maximum of 8 letters) and the number of haplotypes in
each population and the number of individuals possessing each haplotype (see below in section
2).
The second input file consists of either the site matrix data for restriction site data or DNA
sequence data as follows:
1a) Restriction site data:
These input data files consist of site gains ONLY! between populations
Note: All files of this type must have a
*.res
extension.
The data matrix which must be supplied is a 0,1 matrix of recognition site presence=1/absence=0
for each haplotype being compared. If a site is unknown,
it may be coded as any number greater than 1 (I suggest using the numbers 2 - 9 to keep your
column structure linear) and MTDIS with then simply skip the region in its frequency
calculation.
Data MUST be entered using haplotypes as the rows and recognition sites as the columns.
For example, if you have three haplotypes in your species with six polymorphic recognition sites,
where:
site 1 is shared between haplotype 1 + 2
site 2 is shared between haplotype 1 + 3
site 3 is shared between haplotype 2 + 3
site 4 is only present in haplotype 3
site 5 is only present in haplotype 3
2
site 6 is only present in haplotype 1
You should have the following data matrix structure.
36
110001
101000
011110
Before you enter the presence/absence matrix for the data structure, you must specify the total
number of haplotypes observed as the first value followed by the total number of polymorphic
restriction sites detected as the second number.
In the present example 6 sites were detected and 3 haplotypes were described.
Remember to leave at least one space between each data value! Also you must start data entry
on line 1 and column 1. Do not leave empty lines in your data entry as the program may read
these as zero values.
Following construction of the data matrix you will need to enter a data line indicating whether
the site is recognized by a 6-base, 5-base, or 4-base cutter. For example if all 6 restriction sites
in the previous example are recognized by a 6-base cutter, then enter:
36
110001
101000
011110
666666
Finally, you will need to enter the monomorphic sites (shared among all haplotypes recorded) for
all 4-base, 5-base, 6-base, and 8-base cutting enzymes used. Enter these values on a separate line
for each enzyme.
The monomorphic sites detected with 4-base cutters would be entered first, followed by 5-base,
6-base, and 8-base sites respectively.
In the present example only 6-base cutting enzymes were used, and therefore 0 values would be
entered for the other categories. Let us suppose that with the six polymorphic 6-base sites
detected there were 15 monomorphic sites.
3
The data input file would then look like this:
36
110001
101000
011110
666666
0
0
15
0
Multiple intervening random nucleotide sites such as ATTNNNNNNNAAT would be considered a
6-base recognition site since only six unique sites are defined. It was considered unlikely that a
more complex recognition enzyme than an 8-base cutter would be used in a routine population
genetics survey and therefore input for restriction fragment patterns from such enzymes is not
accommodated.
Note: You will have to apply an averaging factor for multiple recognition sequence enzymes that
share sites with other enzymes.
For example, if two 6-base cutters recognize 4 different sites, and one of these sequence sites is
recognized by both enzymes, then subtract 1/8 of the total number of sites recognized by both
enzymes.
Note: It is a good idea to insert a number of lines at the end of your data file describing which
enzyme and fragment sizes correspond to each data column.
These values will not be read by the program and are merely suggestions for your archival
convenience.
Start each of these reference lines with a REM statement or with an ' (apostrophe)
designation. VBASIC recognizes such lines as REMARK lines and will not process their
contents.
Files may be set up with any standard ASCII text editor such as NORTON commander or
CED.COM or most word processing packages that will save text in ASCII format.
The final line of the file MUST contain the word "DONE" to signal that the file input is ended.
For example:
4
36
110001
101000
011110
666666
0
0
15
0
"DONE"
1b) DNA sequence data:
The second type of data set which may be entered is DNA sequence data for single copy DNA.
Note: All data sets of this type must have a
*.seq extension.
I have also included a program called SEQFLTR with this package that will allow you to prepare
raw DNA sequence data from other applications for input into MTDIS. The program essentially
filters out all monomorphic base pairs among the haplotypes you are comparing and writes a
sequential input file for MTDIS containing only the polymorphic sites (see below).
In a similar fashion to *.res files, the first two values that are entered in the MTDIS
*.seq input file must be the number of haplotypes followed by the number of polymorphic DNA
sequence sites.
eg.
36
ACCGAG
GTTGAA
ATCAGG
94
"DONE"
In this example, you will note that DNA sequence data must be entered in sequential format
WITHOUT any spaces between the nucleotide designations.
Nucleotides may be entered in either upper or lower case, but a U designation is not allowed.
Missing sites (i.e. INDEL regions) must be coded as a blank (i.e. "-"). Unknown data sites may
be coded as either an "N", "X", or "?".
Following the entry of polymorphic sites, the number of monomorphic sites are entered. Thus,
in the present example, 100 bp among all 3 haplotypes are examined. Six of these sites are
polymorphic and 94 are monomorphic.
5
If your data set is relatively small then you could list the entire DNA sequence for each haplotype
(including monomorphic sites) in a sequential fashion. You would then still need to indicate the
number of monomorphic sites "outside" of the sequences input to MTDIS. In this case a zero
value would be listed for this input variable.
Note: Array sizes are limited to 32,000 elements, and thus any combination of haplotypes x
number of base pairs screened which exceed this limit will cause the program to 'crash'. If this
is the case you will need to run SEQFLTR to reduce your data set to informative sites only. It is
generally a good idea to do this anyway. Why bother manipulating all those non-informative
sites?
When data entry is complete, you must write "DONE" in quotation marks at the end of the file,
similar to the format for *.res files.
1b.i) Distance calculations:
The algorithm used for the estimation of pairwise genetic distances among the populations being
compared for DNA sequence data is based upon the summation of frequency differences at
nucleotide sites between populations as follows:
S
(

i=1
( | fA1 - fA2 | + | fC1 - fC2 | + | fG1 - fG2 | + | fT1 - fT2 |) / 2 ) / t
where:
S = the total number of polymorphic sites compared.
fA1 and fA2 = the frequency of A nucleotides at site i in population 1 and 2,
respectively.
fC1 and fC2 = the frequency of C nucleotides at site i in population 1 and 2,
respectively.
fG1 and fG2 = the frequency of G nucleotides at site i in population 1 and 2,
respectively.
fT1 and fT2 = the frequency of T nucleotides at site i in population 1 and 2,
respectively.
t = the total number of nucleotides compared.
1b.ii) INDELS (Insertion/deletion sequence elements):
MTDIS will also accommodate the analysis of INDELS in your data set. However,
6
this is optional, as the inclusion or removal of INDEL sequences may be specified in the program
SEQFLTR during the construction of your input files for MTDIS.
If you choose to include the analysis of INDELS then all INDEL stretches will be scored as 1
mutational step change from a sequence which does not possess the INDEL regardless of length.
However, if you know that there are tandem repeat elements in the INDEL stretches then you
may also score these differences according the number of tandem repeats detected. You will need
to answer -YES- to a specific query regarding this within the program. MTDIS will keep track
of the number of missing sites within an INDEL and simply score the number of mutational steps
for these regions as:
bp / tr
where: bp = the number of missing bp in the INDEL, and
tr = the number of bp in the tandem repeat unit
For example if the sequence AAAAACGACGGCTTTTT formed the following repeating
stretches in four haplotypes:
AGCAAAAAACGACGGCTTTTTAAAAACGACGGCTTTTTAAAAACGACGG
CTTTTTCGATACTATATCCA
AGTA-----------------AAAAACGACGGCTTTTTAAAAAGGAGGG
CTTTTTCGGTACCACACCCA
GGCAAAAAACGACGGCTTTTTAAAAACCACGGCTTTTTAAAAACGACGG
CTTTTTCGATACTGTGTCCG
AACG--------------------------------------------------GAACGTTGTGTTTG
*
* *
Haplotypes 1 and 3 would differ from haplotypes 2 and 4, by 1 and 3 mutational steps,
respectively for the INDEL regions, while haplotype 4 would differ from haplotype 2 by two
mutational steps within this region. Of course if you do not feel that tandem repeats
correspond to a stepwise mutation model you may simply score all INDELS as one mutational
step differences regardless of length.
The asterisks is the figure above indicate regions where there are polymorphic nucleotide sites in
other haplotypes that overlap with an INDEL region. MTDIS handles the scoring of these
variants in a different fashion.
For example in the following four haplotypes:
7
AGCAATATTACGGGCAATACGAAGTACAGATTAGACTAGGAAATACTAT
ACTATACGATACTATATCCA
AGTA-------------------------------------------------ACGGTACCACACCCA
GGCAATATTACGAGCAATACGAAGTACAGATTAGACTAGGAAATACTAT
ACTACACGATACTGTGTCCG
AACGATATTACGGGTAATGCGAAGTATAGATTAGACTAGGAAATACTAT
ACTATGGAACGTTGTGTTTG
* *
*
*
Sequence 3 differs from sequence 1 and 4 within the INDEL stretch at the first asterisk, while
sequence 4 differs from sequence 1 and 3 at asterisks 2-4 within the INDEL. In sequence
comparisons that do not include haplotype 2, divergence estimates will be calculated in a normal
fashion. However, distance estimates with haplotype 2 would be averaged for the variant sites
excluding a frequency estimate contributed by INDEL sites in haplotype 2 (i.e. frequencies for
the four variant nucleotides within an INDEL region will be scored as 0). Since nucleotides
cannot be assigned to these positions, the frequency of haplotypes possessing INDELS cannot
contribute to frequency summations for the nucleotides.
The overall effect is that divergence estimates are reduced by 0.5 in comparisons between
haplotypes where one haplotype possesses an INDEL and the other haplotypes possess a variant
nucleotide site within the INDEL.
If you feel that this is an unacceptable bias, you may delete the INDEL regions from the analysis
using SEQFLTR, as previously mentioned.
________________________________________________________________________
2. POPULATION DESIGNATION AND HAPLOTYPE FREQUENCY INPUT FILES:
Note: All files of this type must have a
*.pop extension.
In this data file you will need to construct is a listing of the number of individuals possessing
different haplotypes in the populations you have examined.
Note: All data input files will need to reside in the same directory on your floppy disk or hard
disk, as the program will only ask to specify the path for your first input file. (i.e. the program
assumes both input files are in the same path).
Populations are input as rows.
The first value to input on the first line is the number of populations which you have compared in
your study.
8
The second line of the file should contain the names of the populations in the order in which they
are input in the file. For example, if 5 populations are being compared they may be designated
as follows:
5
"POP-1" "POP-2" "POP-3" "POP-4" "POP-5"
You may input the population names on more than one line. However, no blank
lines should be included and all designations should start in column 1.
Population designations are usually given as abbreviations and are listed in quotations for
alphanumeric designations.
Following input of the population designations it will be necessary to enter the
designations of the haplotypes described in the *.res or *.seq file.
These designations MUST be in the order in which the haplotypes were designated
in the *.res or *.seq file.
For example, if you had three haplotypes designated A, B, & D in the *.res file and you
input these haplotypes as A=row 1; D=row 2; and B=row 3 in the *.res file then you must
enter:
"A" "D" "B"
on the 3rd line of the
*.pop file describing the haplotypes.
For the example given above, if the three haplotypes are simply designated as:
1, 2, and 3, and were entered in that order, your file should look like this:
5
"POP-1" "POP-2" "POP-3" "POP-4" "POP-5"
"1" "2" "3"
Next you will need to input the population frequency data.
Remember: each line will represent a different population and populations MUST be entered in
the order they are specified in the second line of the file.
i.e. population "POP-1" must be entered first.
The first number input on the line will be the number of clones or haplotypes found in that
population.
Following this designation will be the haplotype designation (in alphanumeric quotations)
followed by the number of individuals sampled possessing that haplotype.
9
For example, if in "POP-1" three different haplotypes were found with 2 individuals possessing
haplotype 1, 10 individuals possessing haplotype 2, and 5 individuals possessing haplotype 3 the
following should be entered:
5
"POP-1" "POP-2" "POP-3" "POP-4" "POP-5"
"1" "2" "3"
3 "1" 2 "2" 10 "3" 5
If the entire data file looked like this:
5
"POP-1" "POP-2" "POP-3" "POP-4" "POP-5"
"1" "2" "3"
3 "1" 2 "2" 10 "3" 5
1 "1" 12
2 "1" 6 "2" 18
3 "1" 5 "2" 1 "3" 6
2 "2" 3 "3" 12
"DONE"
'POP-1 = Blackwater Cr.
'POP-2 = Brownwater Cr.
'POP-3 = Clearwater Cr.
'POP-4 = Teawater Cr.
'POP-5 = Fish Cr.
This would indicate that in Population 5, three individuals were sampled which had haplotype 2,
while twelve individuals were sampled which had haplotype 3.
Remember to write "DONE" at the end of your input file.
It is also a good idea to insert a number of lines at the end of the data file indicating the
population name that corresponds to each row for the abbreviations given in the main file.
Remember to start these comment lines with a REM statement or '.
The program will then convert the observed number of individuals possessing each
haplotype into population frequencies.
________________________________________________________________________
3. BOOTSTRAPPING FILES:
10
To produce a number of randomized data sets (bootstrap, half-jacknife) using SEQBOOT in
PHYLIP you will need to modify your input data sets in a format that may be read by
SEQBOOT. These formats are essentially identical for both programs. The main difference is
that you will need to edit the data files used for MTDIS by adding the name of the haplotype,
beginning on each line.
The first 10 columns of each line may only be occupied by the haplotype designation. Data
entry starts at column 11. (See the README file in PHYLIP for further information on how to
run SEQBOOT). The output file generated from SEQBOOT may be read directly by MTDIS
with the following modifications.
3a) Restriction site data:
The first line of the output file will list the number of haplotypes followed by the number of
polymorphic sites being compared. This line should not be modified.
On the second line of the output file indicate the number of randomized data sets that are
contained in the file. Thus if 100 bootstrapping matrices were generated then type 100 on the
second line.
On the third line list the number of nucleotide bp that are detected by each restriction enzyme, in
the same sequence that they entered in the data matrix (i.e. left to right).
On the fourth line list the number of monomorphic sites detected with 4-base cutters.
On the fifth line list the number of monomorphic sites detected with 5-base cutters.
On the sixth line list the number of monomorphic sites detected with 6-base cutters.
On the seventh line list the number of monomorphic sites detected with 8-base cutters.
The input order listed above is identical to the format that would be used to specify input for
*.res files.
On line number 8 type the word "READY" in quotation marks. This signals MTDIS that the
output file from SEQBOOT has been modified to provide all the data to MTDIS for the
calculation of genetic distances.
An example on how to modify the beginning lines in the OUTFILE from PHYLIP for
the above data set after 100 bootstrapping replicates would be as follows:
36
100
666666
0
0
15
11
0
"READY"
Note: Remember to rename your file with a *.res extension, otherwise MTDIS will not
recognize it as an input file.
Also, the haplotype names (listed in column 1-10) generated by PHYLIP will be expected by
MTDIS when it processes bootstrapping files so these names should not be removed from the
PHYLIP OUTFILE.
3b) DNA sequence data:
On the second line of the file insert the number of randomization data sets present in the file.
The first line of the output file will already list the number of haplotypes and polymorphic sites
being compared. This line should not be adjusted.
On the third line list the number of monomorphic bp present among all the haplotypes being
compared.
On the fourth line type the word "READY" in quotation marks. This signals MTDIS that the
output file from SEQBOOT has been modified to provide all the data to MTDIS for the
calculation of genetic distances.
An example for the above data set would look like this:
36
100
94
"READY"
Note: Remember to rename your file with a *.seq extension, otherwise MTDIS will not
recognize it as an input file.
Also, the haplotype names (listed in column 1-10) generated by PHYLIP will be expected by
MTDIS when it processes bootstrapping files so these names should not be removed from the
PHYLIP OUTFILE.
3b.i) Bootstrapping INDEL regions:
12
The output produced by SEQFLTR will contain complete INDEL stretches if this option is
specified. If you enter these regions into SEQBOOT then multiple INDEL sites would be
randomly generated. This would greatly inflate the distance estimates obtained for haplotypes
possessing such INDELs. My recommendation is that you text edit your sequences and remove
all INDEL deletion sites (i.e. "-") except 1 for each INDEL unit. In the case of overlapping
polymorphic sites, one possible solution is to keep the deletion
signature (i.e. "-") at the corresponding position in the haplotype with the deletion. This will of
course result in the generation of greater than the expected number of deletions within this
haplotype following the bootstrapping procedure. One way to compensate for this is to
arbitrarily reduce the number of deletion sites by the number of additions in the sequence. For
example, if you have 1 deletion region in a haplotype and added two more sites to correspond to
polymorphic sites in other haplotypes, then you could remove 2 deletion sites from all
bootstrapped data sets. If you only have 1 or 2 sites generated the job is easy. The question
becomes which 2 sites do I remove if 3 or more deletion sites are generated.? I'll leave that one
to you.
________________________________________________________________________
4. OUTPUT FILES:
Output from MTDIS is written to disk in ASCII as a right-handed pairwise distance matrix.
Such a matrix may be readily input to other programs such as PHYLIP or MEGA (Kumar et al.
1993) for the construction of UPGMA trees or Neighbor-Joining trees.
Two types of output files will be written to disk depending upon whether you are analyzing a:
a) single data set, or
b) multiple data sets
When converting the output file for input into MEGA it will be necessary to put # symbols at
the beginning of each population designation. The distance matrix itself will not need to be
modified.
For converting the file into PHYLIP format, the population designations at the beginning of the
file will need to be removed and redesignated at the beginning of each comparison row in
columns 1-10. The actual values of the distance matrix must follow in column 11 or greater.
This will require some editing of the output file from MTDIS.
Note: All output files have a
* .mtd extension
4a) Single data sets:
The output file generated for the above restriction site data set would look like this:
13
1. Genetic sequence divergence among populations.
Restriction site data.
Input population file = "name of your input file"
Input restriction site data file = "name of your input file"
Population
Population
Population
Population
Population
1
2
3
4
5
= POP-1
= POP-2
= POP-3
= POP-4
= POP-5
0.02567694
0.01038749
0.01973684
0.01365546
0.02182540
0.01984127
0.01699347
0.03650793
0.02738095
0.01468254
2. Intrapopulation nucleotide diversities (diagonal) and interpopulation nucleotide diversities
(upper right matrix).
0.01904271 0.02715121
0.00000000
0.01646273 0.02452618 0.02200108
0.01973684 0.02203425 0.03700919
0.01029748 0.02623225 0.02925230
0.02306144 0.02097605
0.01142857
3. NUCLEON DIVERSITIES (using Formula 8.5 of Nei, M. 1987).
Population
h
+/- standard error
_________________________________________________
POP-1
POP-2
POP-3
POP-4
POP-5
.5882353
0
.3913043
.6212121
.3428571
5.247664E-02
0
4.876184E-02
5.387287E-02
6.982974E-02
In the first section of the output file are listed the population names in the input order they were
entered in either the *.res or *.seq file. The genetic distances given in the right-hand matrix are
the average populations sequence divergence estimates from the algorithm of Danzmann et al.
(1991). Each value corresponds to the paired sets of populations listed at the top of the file in
the sequential order given. For example, row 1 contains 4 values
14
representing the sequential pairs of populations tested with POP-1. (i.e. POP-1 & POP-2;
POP-1 & POP-3; POP-1 & POP-4; and POP-1 & POP-5). In a similar fashion, row 2 would
contain all the pairwise distances with POP-2, etc.
Intrapopulation nucleotide diversities are calculated using a formula equivalent to 10.19 of
Nei(1987) = dx, based upon empirical numbers of individuals sampled for each haplotype,
instead of frequencies. Interpopulation nucleotide diversities are estimated in a similar fashion
and are equivalent to dxy using formula 10.20 of Nei(1987). Nei recommends calculating
interpopulation distances as dA = dxy-((dx + dy)/2). However, the values given in the output
table for MTDIS are simply presented as: dxy.
The ordering of the population pairs is similar to that for the top right-hand matrix, with the
exception that numbers on the diagonal represent intra-population nucleotide diversities. Thus,
row 1 has 5 numbers, with the second value representing the interpopulation nucleotide diversity
between POP-1 & POP-2, etc.
Nucleon diversities are calculated according to formula 8.5 of Nei(1987) and the standard errors
calculated as the square root of the variance for single copy DNA given by formula 8.12 of
Nei(1987).
Note: the determination of nucleotide diversities and nucleon diversities is only given if a single
data set is entered. If multiple data sets are being input the program assumes that these are
randomized input data sets.
Output generated from a single analysis using DNA sequence data would have an identical
format.
4a.i) Analysis of mtDNA haplotype lineages:
Note: MTDIS may also be used to calculate interhaplotype distances by modifying
the input *.pop file such that each row represents a different haplotype, with only 1 haplotype
being present in each row. The sample size must be greater than 1, however, for each haplotype,
and should be identical for each row (= haplotype).
Thus, if you were comparing 10 haplotypes, called C1 - C10 your input *.pop file could look like
this:
10
"C1" "C2" "C3" "C4" "C5" "C6" "C7" "C8" "C9" "C10"
"C1" "C2" "C3" "C4" "C5" "C6" "C7" "C8" "C9" "C10"
1 "C1" 3
1 "C2" 3
1 "C3" 3
1 "C4" 3
15
1 "C5" 3
1 "C6" 3
1 "C7" 3
1 "C8" 3
1 "C9" 3
1 "C10" 3
"DONE"
The second and third rows would need to be repeated since the top row represents
the haplotypes being compared (replacing the populations) and all haplotypes need to be
specified in the *.pop file.
4b) Multiple data sets:
Output from the randomization data sets (eg. bootstrapping matrices) produced by SEQBOOT
are analyzed by MTDIS to produce a sequential set of interpopulation
distances presented as upper right-hand matrices. These data sets are in the format that may be
read by the PHYLIP program NEIGHBOR to construct multiple UPGMA or Neighbor-Joining
trees. The treefile output from NEIGHBOR may then be read by CONSENSE of PHYLIP to
produce a majority-rule consensus tree of the randomized data sets. In other words, MTDIS is
used to generate the distance matrices, but all other aspects of performing the bootstrapping tests
are done using the programs SEQBOOT, NEIGHBOR, and CONSENSE in PHYLIP. Users are
referred to the *.DOC files in PHYLIP for further instructions on implementing these specific
programs.
________________________________________________________________________
5. COMPLETE DNA SEQUENCE INPUT FILES:
I have provided an additional program with this package called SEQFLTR.EXE. This
stand alone executable allows you to read in the complete DNA sequence of several haplotypes.
Software packages that perform sequence alignment procedures will print out left-justified
sequences that may be modified with a minimum of effort for input into MTDIS.
All data files read into SEQFLTR must have the *.inf extension.
DNA sequences may be input in either the SEQUENTIAL or INTERLEAVED format.
Sequential input is structured such that all the DNA sequence from one haplotype is input before
the data from the next haplotype is entered. Interleaved input is structured such that the DNA
sequences from one haplotype is entered on one line followed by the sequence from the next
haplotype on the following line and so on. When the sequences from all haplotypes have been
entered a line space is usually inserted, and the next set of data from each haplotype is entered.
This data entry format is repeated in 1 line data blocks for all haplotypes until the complete
sequence is entered.
16
Note: all identical sites across haplotypes must be positioned in the same data entry column in
the input file, otherwise a data entry error will occur. In other words, you may not have 61 bp in
one row for a given haplotype, and 60 bp in the next row for the second haplotype.
Spaces are permitted between nucleotide designations, but you may NOT have more
than ONE space between base pairs. However, more than one line space (or no line spaces) may
be left between haplotype inputs.
You will be asked by SEQFLTR whether your data is in a sequential or interleaved
format.
An example of Sequential vs. Interleaved input formats is given as follows:
Sequential:
3 80 0
ACCGAGGGGG GACTTGGAAA CAGACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT
ACCCTGAGGA ACTGACCCTC
GTCGAA---- GACTTGGAAA C?GACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT
ACCCTGAGGA ACTGACCCTC
ATCGAGGGGG TACTTCCTAA CAGACACAGT ACCTCCCAGA CGGACTATTA CTATCACGTT
ACCCTGAGGA ACTGATTTTC
Interleaved:
3 80 0
ACCGAGGGGG GACTTGGAAA
GTCGAA---- GACTTGGAAA
ATCGAGGGGG TACTTCCTAA
CAGACTCAGT ACCTCCCAGA
C?GACTCAGT ACCTCCCAGA
CAGACACAGT ACCTCCCAGA
CGGACTATTA CGATCACGTT
CGGACTATTA CGATCACGTT
CGGACTATTA CTATCACGTT
ACCCTGAGGA ACTGACCCTC
ACCCTGAGGA ACTGACCCTC
ACCCTGAGGA ACTGATTTTC
You will note that there are three numeric inputs required on the first line of the data input file.
The first value indicates the number of haplotypes being compared. In the above example, 3
haplotypes are being compared. The second value lists the number of base pairs compared.
The final value indicates whether you wish to keep track of the haplotype designations in the file.
17
The zero value in the current example indicates that you do not wish to give these haplotypes
designations (perhaps they are just called haplotype 1, 2, and 3) or you have already designated
them in the *.pop file.
However, if you wish to designate the haplotypes then enter the value 3 for this input variable
and SEQFLTR will look for three haplotype designations at the end of the file.
Many programs such as MEGA and PHYLIP write this information in the left-hand
column beside each haplotype and start data entry in a subsequent column. To accommodate all
these formats is a difficult programming task. Thus, I have taken the easy way out and will
require you to do some text editing. If your data is one of these formats then you will have to
remove all the leading columns and start data entry in column 1 as shown. If you specify a
haplotype name input then these names must be given IMMEDIATELY following the last row of
data entry without any intervening spaces as follows:
3 80 3
ACCGAGGGGG GACTTGGAAA CAGACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT
ACCCTGAGGA ACTGACCCTC
GTCGAA---- GACTTGGAAA C?GACTCAGT ACCTCCCAGA CGGACTATTA CGATCACGTT
ACCCTGAGGA ACTGACCCTC
ATCGAGGGGG TACTTCCTAA CAGACACAGT ACCTCCCAGA CGGACTATTA CTATCACGTT
ACCCTGAGGA ACTGATTTTC
Haplotype1
Haplotype2
Haplotype3
If INDELS are present in you your DNA sequences you may tell SEQFLTR to either keep these
sections in the output *.seq file, or remove them entirely.
IMPORTANT NOTE: The order of the haplotypes in your *.inf file must be the same
order you designate the haplotypes in the *.pop file.
The output file produced by SEQFLTR will have the *.res extension and will be in a proper
format for input into MTDIS (i.e. only the polymorphic sites across haplotypes will be listed,
followed by the number of monomorphic sites). In addition, SEQFLTR will give a listing of
where the polymorphic sites occur within the original sequences.
5a) Array size:
The MS-DOS version of VISUAL BASIC with which SEQFLTR was compiled limits array sizes
to approximately 32,000 elements. Therefore, you must keep this fact in
18
mind when you are converting DNA sequences for input into MTDIS. If you are analyzing more
than this number you will have to truncate the bp range accordingly. This is not a major
problem. Since the MTDIS input file may be structured with only polymorphic sites the total
number of DNA sequences bp screened could potentially be quite large. You simply have to
append the output from several different input data sets. Append (any text editor will do this)
the output from each analysis in a sequential fashion (i.e. the polymorphic sites from haplotype 1
must listed first, followed by haplotype 2, etc.). Remember, no spaces are allowed between bp,
but multiple lines of input are allowed. Therefore, simply place the output from file 1 in the first
row, followed by the output from file 2 in the next row, and finally the output from additional
files in additional rows. Then go on to the next haplotype. Do not leave any spaces between
haplotype entries, and remember to add together the number of monomorphic bp detected from
each input file for entry at the bottom of your MTDIS input file.
________________________________________________________________________
If you have problems with setting up the data files or implementing the program
please do not hesitate to contact me directly by E-mail at:
rdanzman@uoguelph.ca
Roy Danzmann,
Department of Zoology,
University of Guelph,
Guelph, ONtario, Canada
N1G 2W1
dated: June 1, 1999.
These programs may be obtained by anonymous ftp to site:
131.104.50.2
Use the password: danzmann
Then use the command GET to retrieve the file:
MTDIS.ZIP
PKUNZIP this file. It contains the following files.
________________________________________________________________________
19
Files provided with MTDIS.
1. MTDIS.EXE
MS-DOS stand alone executable for estimating average
sequence divergence among populations.
2. SEQFLTR.EXE
MS-DOS stand alone executable for converting complete
DNA sequences into the correct input format for MTDIS.
3. example.res
An example file showing the proper format for the input
restriction site data.
4. example.seq
An example file showing the proper format for the input
DNA sequence data.
5. example.pop
An example file showing the proper format for the input
population data.
6. bootexam.res
bootexam.seq
Example files showing how to convert the restriction
site and DNA sequence input files into a format that
may be read by SEQBOOT in PHYLIP.
7. bootfile.res
bootfile.seq
Example files showing how to convert the randomized
OUTFILEs from SEQBOOT in PHYLIP into a format that
may be read by MTDIS.
8. examres.mtd
The output file generated from the example.res and
example.pop data files.
9. examseq.mtd
The output file generated from the example.seq and
example.pop data files.
10. bootres.mtd
The output file generated from the bootfile.res and
example.pop data files. This file may be read directly
by NEIGHBOR in PHYLIP.
11. bootseq.mtd
The output file generated from the bootfile.seq and
example.pop data files. This file may be read directly
by NEIGHBOR in PHYLIP.
12. examres.meg
examseq.meg
An example file showing how to convert the output
examres.mtd and examseq.mtd files into a format that
may be read by the program MEGA (Kumar et al. 1993).
Note: Remember to insert a line between the population
designations and the first line of the distance matrix.
20
13. readme.now
An ASCII text file with instructions on how to use
these programs.
14. readme.doc
A WORD file with instructions on how to use these
programs.
________________________________________________________________________
References:
Danzmann, R.G., M.M. Ferguson, S. Skulason, S.S. Snorrason, and D.L.G. Noakes,
1991. Mitochondrial DNA diversity among four sympatric morphs of Arctic
charr, Salvelinus alpinus L., from Thingvallavatn, Iceland. J. of Fish Biol. 39:
649-659.
Felsenstein, J. 1993. Phylogeny Inference Package (PHYLIP). Version 3.5. Department
of Genetics, University of Washington, Seattle.
Kumar, S., K. Tamura, and M. Nei, 1993. MEGA: Molecular Evolutionary Genetics
Analysis. ver. 1.01. The Pennsylvania State University, University Park, PA
16802.
Nei, M. 1987. Molecular Evolutionary Genetics. Columbia University Press, New
York.
________________________________________________________________________
Citation:
Danzmann, R.G. 1998. MTDIS: A computer program to estimate genetic distances among
populations based upon variation in single copy DNA. J. Hered. 89: 283-284.
21
Download