Phylogenetics exercise, Bioinformatics for cell biologists, 2011 This exercise will simulate the construction of a phylogenetic tree for organisms where only individual genes (rather than whole genomes) are sequenced, akin to the construction of 16S rRNA-based trees for bacteria. The same tools can also be used to find how paralogs are related. 1. Select a gene (e.g. CYTB) and find its DNA sequence for different mammalian species (somewhere 5-10 species). Entrez Gene is one of the databases that has such sequence information, look for the link labeled FASTA. Put the sequences in a single FASTA file (use Notepad or similar, see http://www.bioinformatics.nl/tools/crab_fasta.html for an explanation of the format). Name the sequences by abbreviations of the species names (anything over 10 letters will be truncated). 2. Log into the server 130.237.142.51. You will need both a SSH client (PuTTy) for running programs, and an SCP client (WinSCP) for transferring files. SSH will give you access to a Unix/Linux command line. Some useful commands: cd folder (to change folder; cd .. to go up one level) ls (shows the contents of the current folder) mv source destination (for renaming a file) cp source destination (copies a file) rm filename (deletes a file; rm -r deletes a folder) less filename (for reading a text file; q to exit, f and b to scroll) mkdir folder (makes a new folder) keys: ctrl+C (shuts down the running program), tab (auto-completes file name), arrow up (gives last command) 3. Run multiple sequence alignment using Muscle. Run: muscle -in in.fa -phyiout alignment 3. Build a tree using maximum likelihood Rename (or rather copy) the output file from muscle 'infile' (cp alignment infile), then run: phylip dnaml Dnaml will create the files 'outfile' and 'outtree' 4. View your tree Have a glance at the files outfile and outtree, using less or another program for reading text files outtree is in Newick format, which can be read by the viewer TreeView (as well as several other tree viewing programs) Copy outtree to your computer and visualise the tree using TreeView Note that your tree is unrooted, so at each node, you don't know which one of the three branches it connects is the ancestor and which two derive from that ancestor. There are good illustrations on rooted versus unrooted trees at http://www.ncbi.nlm.nih.gov/About/primer/phylo.html (under Tree Building: Key Features of DNA-based Phylogenetic Trees). 5. Add a sequence for the same gene from a bird/reptile to use as outgroup. Build a new tree using maximum likelihood and view it. Open the tree in TreeView and root the tree (Tree -> Define outgroup..., then Tree -> Root with outgroup). By adding one or several species you know are less related to any of the species you investigating then they are to one another, you get the time direction in the tree. So you have a rooted tree. 6. Build a tree using neighbor joining Name the output file from muscle 'infile' (if needed), then run phylip dnadist This will create a distance matrix (you can look at 'outfile' that it produces) Rename 'outfile' to 'infile' and run phylip neighbor (mv outfile infile) The file 'outtree' can be viewed in TreeView Maximum likelihood and neighbor joining are two different algorithms for constructing phylogenetic trees. There are a few other algorithms available in the phylip package. Neighbor joining is the most commonly used algorithm for tree construction. 7. Bootstrap a tree to see how well your data supports it Name the output file from muscle 'infile' Run phylip seqboot and make 100 bootstrap randomizations (the default value) of your alignment (it will ask you for seed for random number generation - anything works) Then run phylip dnadist and phylip neighbor, use the M option followed by D (and 100, for the number of bootstrap samples) Rename outtree to intree and run phylip consense to count how often the trees agree View the outree file that phylip consense created (In TreeView, set Tree -> Show internal edge labels) A high bootstrap value says that there is enough data, and the data is consistent enough, to support a node. Bootstrapping creates pseudo-replicates of the alignment. A bootstrap replicate contains the same number of nucleotide positions as the original, but they are taken randomly, allowing some positions in the original to be represented several times while some positions are not represented at all. From replicate a tree is constructed. The bootstrap value says how many of the trees created this way contains a particular node.