Phylogenetics

advertisement
Phylogenetics exercise, Bioinformatics for cell biologists, 2011
This exercise will simulate the construction of a phylogenetic tree for organisms where only individual
genes (rather than whole genomes) are sequenced, akin to the construction of 16S rRNA-based trees for
bacteria. The same tools can also be used to find how paralogs are related.
1. Select a gene (e.g. CYTB) and find its DNA sequence for different mammalian species (somewhere
5-10 species). Entrez Gene is one of the databases that has such sequence information, look for the link
labeled FASTA. Put the sequences in a single FASTA file (use Notepad or similar, see
http://www.bioinformatics.nl/tools/crab_fasta.html for an explanation of the format). Name the
sequences by abbreviations of the species names (anything over 10 letters will be truncated).
2. Log into the server 130.237.142.51. You will need both a SSH client (PuTTy) for running programs,
and an SCP client (WinSCP) for transferring files.
SSH will give you access to a Unix/Linux command line. Some useful commands:
cd folder (to change folder; cd .. to go up one level)
ls (shows the contents of the current folder)
mv source destination (for renaming a file)
cp source destination (copies a file)
rm filename (deletes a file; rm -r deletes a folder)
less filename (for reading a text file; q to exit, f and b to scroll)
mkdir folder (makes a new folder)
keys: ctrl+C (shuts down the running program), tab (auto-completes file name), arrow up (gives last
command)
3. Run multiple sequence alignment using Muscle.
Run: muscle -in in.fa -phyiout alignment
3. Build a tree using maximum likelihood
Rename (or rather copy) the output file from muscle 'infile' (cp alignment infile), then run: phylip
dnaml
Dnaml will create the files 'outfile' and 'outtree'
4. View your tree
Have a glance at the files outfile and outtree, using less or another program for reading text files
outtree is in Newick format, which can be read by the viewer TreeView (as well as several other tree
viewing programs)
Copy outtree to your computer and visualise the tree using TreeView
Note that your tree is unrooted, so at each node, you don't know which one of the three branches it
connects is the ancestor and which two derive from that ancestor. There are good illustrations on rooted
versus unrooted trees at http://www.ncbi.nlm.nih.gov/About/primer/phylo.html (under Tree Building:
Key Features of DNA-based Phylogenetic Trees).
5. Add a sequence for the same gene from a bird/reptile to use as outgroup. Build a new tree using
maximum likelihood and view it. Open the tree in TreeView and root the tree (Tree -> Define
outgroup..., then Tree -> Root with outgroup).
By adding one or several species you know are less related to any of the species you investigating then
they are to one another, you get the time direction in the tree. So you have a rooted tree.
6. Build a tree using neighbor joining
Name the output file from muscle 'infile' (if needed), then run phylip dnadist
This will create a distance matrix (you can look at 'outfile' that it produces)
Rename 'outfile' to 'infile' and run phylip neighbor (mv outfile infile)
The file 'outtree' can be viewed in TreeView
Maximum likelihood and neighbor joining are two different algorithms for constructing phylogenetic
trees. There are a few other algorithms available in the phylip package. Neighbor joining is the most
commonly used algorithm for tree construction.
7. Bootstrap a tree to see how well your data supports it
Name the output file from muscle 'infile'
Run phylip seqboot and make 100 bootstrap randomizations (the default value) of your alignment
(it will ask you for seed for random number generation - anything works)
Then run phylip dnadist and phylip neighbor, use the M option followed by D (and 100, for the number
of bootstrap samples)
Rename outtree to intree and run phylip consense to count how often the trees agree
View the outree file that phylip consense created (In TreeView, set Tree -> Show internal edge labels)
A high bootstrap value says that there is enough data, and the data is consistent enough, to support a
node.
Bootstrapping creates pseudo-replicates of the alignment. A bootstrap replicate contains the same
number of nucleotide positions as the original, but they are taken randomly, allowing some positions in
the original to be represented several times while some positions are not represented at all. From
replicate a tree is constructed. The bootstrap value says how many of the trees created this way contains
a particular node.
Download