Gene family analysis - BITS Workshop Comparative Genomics Prof. Dr. Klaas Vandepoele Objective Identify and characterize vertebrate gene family members encoding for TLN2 Methods - Sequence retrieval : Ensembl (http://www.ensembl.org) or OrthoMCLDB (http://www.orthomcl.org/orthomcl/) - MUSCLE http://www.ebi.ac.uk/Tools/msa/muscle/ - BioEdit Alignment Editor - JALVIEW Alignment editor http://www.jalview.org/ [Download – Web Start] - Gblocks http://molevol.cmima.csic.es/castresana/Gblocks_server.html - Tree construction: MEGA (local PC) or http://atgc.lirmm.fr/phyml/ - Automated Tree construction pipeline: http://www.phylogeny.fr/simple_phylogeny.cgi […] indicate optional steps for the fast people Always save your data/files as plain .txt files; edit them using WordPad Workflow 1. Collect experimental reference gene(s): 1.1. Using Ensembl – Homo sapiens, search for the TLN2 gene. Identify the longest transcript, select the corresponding protein (ENSP00000303476) and explore the protein domains using the Protein Summary option (menu left). 1.2. Retrieve the protein sequence via UniProt – Sequence – Fasta, save as hsapTLN2.fa. 1.3. Use the TLN2 protein to search for homologs in the OrthoMCLDB using BLASTP. [ As an alternative, you can also search using the Ensembl Protein ID (ENSP00000303476) ] 1.4. Identify the Orthologous Group ID (OG) containing the TLN2 gene in the OrthoMCLDB. 2. Construct multiple sequence alignment (MSA) 2.1. Perform New Search > Search for Groups using OG5_129801, and save as simple text file (OGXXX_proteins.txt) 2.2. Reduce the final dataset to 5 species (include at least 1 outgroup species). For example, select hsap (human), mmus (mouse), ggal (chicken), drer (Zebrafish) and cele (worm). 2.3. Shorten the Accession numbers to max. 9 characters + remove atypical characters like * | { } Why? many alignment tools don’t like these strange characters… 2.4. Create a multiple sequence alignment (MSA) using the MUSCLE tool. 2.5. Select tab Result Summary, save result ‘Alignment in CLUSTAL format’ as file ‘OG5_129801_proteins_msa.txt’ 1 3. Remove non-conserved alignment positions 3.1. Open the MSA file in BioEdit and change Mode to Edit. 3.2. Clean the alignment, by only retaining the conserved positions. Save your clean alignment again as FASTA alignment ‘OG5_129801_proteins_msa_edit1.fas’ 3.3. [ Alternatively, use the Gblocks server to do this for you (‘Allow smaller final blocks’ and ‘Allow gap positions’). Note that your input file should be totally clean (>AC followed by sequence on next line. Remove additional text after AC) ! ] 3.4. Check if partial sequences are present. If so, remove these entries as they do not contain sufficient phylogenetic information for the tree construction. 4. Construct phylogenetic tree 4.1. Open the MEGA program on your PC. First you will need to convert your clean alignment to the MEG format. 4.2. File – Convert ot MEGA: Give the location of the input format and specify the input data format is FASTA. 4.3. Save output MEG file and close the Format Converter program 4.4. In the MEGA main program, File – Open data, and select the MEG file you just created. Indicate this file contains protein sequences. 4.5. In the main window, select Phylogeny – Neighbor-Joining. Settings 4.5.1. Test of Phylogeny: Bootstrap method, using 500 replications 4.5.2. Gaps/Missing data: Pairwise deletion 5. Classification of homologs using phylogenetic tree construction 5.1. Determine orthologous gene relationships together with recent/ancient duplication events. 6. You want to see the full tree containing all homlogs? 6.1. Starting from step 2.1 (file OGXXX_proteins.txt), generate a MSA of all proteins in OG5_129801 using MUSCLE, save as OG5_129801_proteins_msa.txt, open in BioEdit and Save as FASTA alignment file. 6.2. Repeat step 4 and 5. 7. [ Submit OG5_129801_proteins_msa.txt to http://www.phylogeny.fr/simple_phylogeny.cgi and evaluate the tree ] 2