Gene family analysis - BITS Workshop Comparative Genomics

advertisement
Gene family analysis - BITS Workshop Comparative Genomics
Prof. Dr. Klaas Vandepoele
Objective Identify and characterize vertebrate gene family members encoding for
TLN2
Methods
- Sequence retrieval : Ensembl (http://www.ensembl.org) or OrthoMCLDB
(http://www.orthomcl.org/orthomcl/)
- MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/
- BioEdit Alignment Editor
- JALVIEW Alignment editor http://www.jalview.org/ [Download – Web Start]
- Gblocks http://molevol.cmima.csic.es/castresana/Gblocks_server.html
- Tree construction: MEGA (local PC) or http://atgc.lirmm.fr/phyml/
- Automated Tree construction pipeline: http://www.phylogeny.fr/simple_phylogeny.cgi
[…] indicate optional steps for the fast people
Always save your data/files as plain .txt files; edit them using WordPad
Workflow
1. Collect experimental reference gene(s):
1.1. Using Ensembl – Homo sapiens, search for the TLN2 gene. Identify the longest
transcript, select the corresponding protein (ENSP00000303476) and explore the
protein domains using the Protein Summary option (menu left).
1.2. Retrieve the protein sequence via UniProt – Sequence – Fasta, save as
hsapTLN2.fa.
1.3. Use the TLN2 protein to search for homologs in the OrthoMCLDB using BLASTP.
[ As an alternative, you can also search using the Ensembl Protein ID
(ENSP00000303476) ]
1.4. Identify the Orthologous Group ID (OG) containing the TLN2 gene in the
OrthoMCLDB.
2. Construct multiple sequence alignment (MSA)
2.1. Perform New Search > Search for Groups using OG5_129801, and save as simple
text file (OGXXX_proteins.txt)
2.2. Reduce the final dataset to 5 species (include at least 1 outgroup species). For
example, select hsap (human), mmus (mouse), ggal (chicken), drer (Zebrafish) and
cele (worm).
2.3. Shorten the Accession numbers to max. 9 characters + remove atypical characters
like * | { }
Why? many alignment tools don’t like these strange characters…
2.4. Create a multiple sequence alignment (MSA) using the MUSCLE tool.
2.5. Select tab Result Summary, save result ‘Alignment in CLUSTAL format’ as file
‘OG5_129801_proteins_msa.txt’
1
3. Remove non-conserved alignment positions
3.1. Open the MSA file in BioEdit and change Mode to Edit.
3.2. Clean the alignment, by only retaining the conserved positions. Save your clean
alignment again as FASTA alignment ‘OG5_129801_proteins_msa_edit1.fas’
3.3. [ Alternatively, use the Gblocks server to do this for you (‘Allow smaller final blocks’
and ‘Allow gap positions’). Note that your input file should be totally clean (>AC
followed by sequence on next line. Remove additional text after AC) ! ]
3.4. Check if partial sequences are present. If so, remove these entries as they do not
contain sufficient phylogenetic information for the tree construction.
4. Construct phylogenetic tree
4.1. Open the MEGA program on your PC. First you will need to convert your clean
alignment to the MEG format.
4.2. File – Convert ot MEGA: Give the location of the input format and specify the input
data format is FASTA.
4.3. Save output MEG file and close the Format Converter program
4.4. In the MEGA main program, File – Open data, and select the MEG file you just
created. Indicate this file contains protein sequences.
4.5. In the main window, select Phylogeny – Neighbor-Joining. Settings
4.5.1. Test of Phylogeny: Bootstrap method, using 500 replications
4.5.2. Gaps/Missing data: Pairwise deletion
5. Classification of homologs using phylogenetic tree construction
5.1. Determine orthologous gene relationships together with recent/ancient duplication
events.
6. You want to see the full tree containing all homlogs?
6.1. Starting from step 2.1 (file OGXXX_proteins.txt), generate a MSA of all proteins in
OG5_129801 using MUSCLE, save as OG5_129801_proteins_msa.txt, open in
BioEdit and Save as FASTA alignment file.
6.2. Repeat step 4 and 5.
7. [ Submit OG5_129801_proteins_msa.txt to
http://www.phylogeny.fr/simple_phylogeny.cgi and evaluate the tree ]
2
Download