GE3M25: Bioinformatics, class 4 TCD, 03/12/2015 Karsten Hokamp, PhD Genetics Trinity College Dublin, The University of Dublin GE3M25 Data Handling Module Content Python Programming Bioinformatics ChIP-Seq analysis Trinity College Dublin, The University of Dublin Overview • Multiple alignments • Phylogenetic trees • Examples and Exercises http://bioinf.gen.tcd.ie/GE3M25/ Trinity College Dublin, The University of Dublin Why multiple alignments? Comparative genomics Phylogenetic studies Hierarchical function annotation: homologs, domains, motifs Gene identification, validation Structure comparison, modelling MSA Interaction networks RNA sequence, structure, function Human genetics, SNPs Therapeutics, drug design insertion domain DBD Therapeutics, drug discovery LBD binding sites / mutations Trinity College Dublin, The University of Dublin Multiple alignment of upstream sequence from ‘eve’ gene across 12 Drosophila species Eve gene Conserved Trinity College Dublin, The University of Dublin non-conserved Example analysis Identify a sequence to align – Heat shock protein beta 8 Find homologous sequences in different species – BLAST Retrieve the sequences Run multiple alignment Compare two alignment tools Trinity College Dublin, The University of Dublin Find gene in NCBI Gene Scroll down to get protein sequence Trinity College Dublin, The University of Dublin Extract Protein sequence Trinity College Dublin, The University of Dublin Extract Protein sequence Trinity College Dublin, The University of Dublin RUN BLAST Blast allows users to enter a protein or DNA sequence and search a database for similar sequences Trinity College Dublin, The University of Dublin BLAST RESULTS Click taxonomy reports to access results by organism Trinity College Dublin, The University of Dublin BLAST hits by Organism The top blast hit from a particular organism is usually the ortholog to the sequence that you entered into BLAST Trinity College Dublin, The University of Dublin Select the accession numbers of several orthologs and add to a file Paste accession numbers in NCBI Protein to retrieve the protein sequence Trinity College Dublin, The University of Dublin List of protein orthologs Select ‘Display Settings’ to retrieve proteins in FASTA format Trinity College Dublin, The University of Dublin Lists FASTA format with summary data Choose FASTA (text) to just get the FASTA sequence without the summary data Trinity College Dublin, The University of Dublin FASTA format Copy FASTA data to a file and save Trinity College Dublin, The University of Dublin Rename FASTA sequences The alignment program displays the first word after the ‘>’ symbol. Edit FASTA header to include the species Trinity College Dublin, The University of Dublin EBI-EMBL MSA http://www.ebi.ac.uk/Tools/msa/ Trinity College Dublin, The University of Dublin Select the type of sequence Paste your FASTA sequence or upload a file Be notified by email for larger jobs Trinity College Dublin, The University of Dublin Trinity College Dublin, The University of Dublin Trinity College Dublin, The University of Dublin Highly conserved region 30 identical matches 28 similar matches Length = 243 Trinity College Dublin, The University of Dublin Colour residues by their physicochemical properties Trinity College Dublin, The University of Dublin Click ‘Results Summary’ to see how identical each pair of sequence are Trinity College Dublin, The University of Dublin Alternative Approach: UniProt 1. Browse to uniprot.org 2. Search for HSPB8 3. Show up to 100 hits 4. Sort by gene name 5. Select 5-8 entries from different species for HSPB8 6. Include a fish, an insect, a bird and mammals (protein lengths should be similar) 7. Click on 'Align' Trinity College Dublin, The University of Dublin Alternative Approach: UniProt Trinity College Dublin, The University of Dublin Alternative Approach: UniProt • Edit and resubmit to change headers, remove sequences • Selection at bottom of page to remove sequences, e.g. fly Trinity College Dublin, The University of Dublin Alternative Approach: UniProt mismatch favoured over gap insertion: Trinity College Dublin, The University of Dublin Alternative Approach: UniProt Trinity College Dublin, The University of Dublin Exercise Copy sequences and run in a different alignment tool: http://www.ebi.ac.uk/Tools/msa Trinity College Dublin, The University of Dublin Results Summary Percent Identity Matrix 1 2 3 4 5 6 7 what is the percent amino acid sequence identity between human and cow? Trinity College Dublin, The University of Dublin EBI-EMBL MSA http://www.ebi.ac.uk/Tools/msa/ Trinity College Dublin, The University of Dublin Trinity College Dublin, The University of Dublin Matrix: No matrix 33 identical matches 27 similar matches Length = 244 Trinity College Dublin, The University of Dublin Matrix: PAM 350 matrix 33 identical matches 27 similar matches Length = 258 A lot of gaps introduced! Trinity College Dublin, The University of Dublin Matrix: BLOSUM 62 matrix 34 identical matches 29 similar matches Length = 248 Highest identity score Trinity College Dublin, The University of Dublin PAM and BLOSUM matrix Matrix Best in determining PAM 40/ blosum 90 Short similar (conserved) alignments PAM 250 Longer more divergent alignments Pam 160/ blosum 80 Detecting members of protein families blosum 62 In finding all potential similarities Trinity College Dublin, The University of Dublin Adapted from Baxevanis 2005 Exercise Investigation of the 5-HT (Serotonin) receptors The serotonin receptors, also known as 5-hydroxytryptamine receptors or 5-HT receptors, are a group of G protein-coupled receptors (GPCRs) and ligand-gated ion channels (LGICs) found in the central and peripheral nervous systems. Serotonin receptors are found in almost all animals and humans and are even known to regulate longevity and behavioral aging in the primitive nematode, Caenorhabditis elegans. Trinity College Dublin, The University of Dublin Step 1 – collect sequences http://uniprot.org Sort by Entry name Trinity College Dublin, The University of Dublin Step 2 – alignment Trinity College Dublin, The University of Dublin Step 3 – summary information Trinity College Dublin, The University of Dublin Align across species Step 1 – collect sequences Trinity College Dublin, The University of Dublin Search for 5HT1A and sort by Entry name Step 2 – alignment Trinity College Dublin, The University of Dublin Step 3 – summary Trinity College Dublin, The University of Dublin Exercise: Pick one of 5HT1B, 5HT1D, 5HT1E, 5HT1F and check for conservation across species Trinity College Dublin, The University of Dublin Exercise: Pick all entries starting with 5HT and check for conservation within and across species Trinity College Dublin, The University of Dublin Fasta header reformatting >sp|O42385|5H1AA_TAKRU 5-hydroxytryptamine receptor 1A-alpha OS=Takifugu rubripes GN=htr1aa PE=3 SV=1 MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI FGNSCVVAAIALERSLQNVANYLIGSLAVTDLMVSVLVLPMAALYQVLNKWTLGQDICDL FIALDVLCCTSSILHLCAIALDRYWAITDPIDYVNKRTPRRAAVLISVTWLIGFSISIPP MLGWRSAEDRANPDACIISQDPGYTIYSTFGAFYIPLILMLVLYGRIFKAARFRIRKTVK KTEKAKASDMCLTLSPAVFHKRANGDAVSAEWKRGYKFKPSSPCANGAVRHGEEMESLEI IEVNSNSKTHLPLPNTPQSSSHENINEKTTGTRRKIALARERKTVKTLGIIMGTFIFCWL PFFIVALVLPFCAENCYMPEWLGAVINWLGYSNSLLNPIIYAYFNKDFQSAFKKILRCKF HRH Loads of information in the header only first bit shows up in alignment Trinity College Dublin, The University of Dublin Fasta header reformatting >sp|O42385|5H1AA_TAKRU 5-hydroxytryptamine receptor 1A-alpha OS=Takifugu rubripes GN=htr1aa PE=3 SV=1 MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI FGNSCVVAAIALERSLQNVANYLIGSLAVTDLMVSVLVLPMAALYQVLNKWTLGQDICDL FIALDVLCCTSSILHLCAIALDRYWAITDPIDYVNKRTPRRAAVLISVTWLIGFSISIPP MLGWRSAEDRANPDACIISQDPGYTIYSTFGAFYIPLILMLVLYGRIFKAARFRIRKTVK KTEKAKASDMCLTLSPAVFHKRANGDAVSAEWKRGYKFKPSSPCANGAVRHGEEMESLEI IEVNSNSKTHLPLPNTPQSSSHENINEKTTGTRRKIALARERKTVKTLGIIMGTFIFCWL PFFIVALVLPFCAENCYMPEWLGAVINWLGYSNSLLNPIIYAYFNKDFQSAFKKILRCKF HRH Reformat headers to show the organism Trinity College Dublin, The University of Dublin Fasta header reformatting perl -p -i -e 's/>.+\|(5H.+?)_.+OS=(.+?) (.+?)/>${1}_${2}_${3}/' uniprot_5HT.fasta Achieved through Perl one-liner http://bioinf.gen.tcd.ie/pol Trinity College Dublin, The University of Dublin Fasta header reformatting >5H1AA_Takifugu_rubripes GN=htr1aa PE=3 SV=1 MDLRATSSNDSNATSGYSDTAAVDWDEGENATGSGSLPDPELSYQIITSLFLGALILCSI FGNSCVVAAIALERSLQNVANYLIGSLAVTDLMVSVLVLPMAALYQVLNKWTLGQDICDL FIALDVLCCTSSILHLCAIALDRYWAITDPIDYVNKRTPRRAAVLISVTWLIGFSISIPP MLGWRSAEDRANPDACIISQDPGYTIYSTFGAFYIPLILMLVLYGRIFKAARFRIRKTVK KTEKAKASDMCLTLSPAVFHKRANGDAVSAEWKRGYKFKPSSPCANGAVRHGEEMESLEI IEVNSNSKTHLPLPNTPQSSSHENINEKTTGTRRKIALARERKTVKTLGIIMGTFIFCWL PFFIVALVLPFCAENCYMPEWLGAVINWLGYSNSLLNPIIYAYFNKDFQSAFKKILRCKF HRH Important bits at start of line and connected via '_' Trinity College Dublin, The University of Dublin Phylogenetic Trees - based on multiple sequence alignments - show relation between sequences/species Darwin, On the Origin of Species Trinity College Dublin, The University of Dublin Alignment to Tree Trinity College Dublin, The University of Dublin Alignment to Tree Trinity College Dublin, The University of Dublin Phylogenetic Trees Baum, D. (2008) Reading a phylogenetic tree: The meaning of monophyletic groups. Nature Education 1(1):190 Trinity College Dublin, The University of Dublin Phylogenetic Trees Baum, D. (2008) Reading a phylogenetic tree: The meaning of monophyletic groups. Nature Education 1(1):190 Trinity College Dublin, The University of Dublin Phylogenetic Trees Baum, D. (2008) Reading a phylogenetic tree: The meaning of monophyletic groups. Nature Education 1(1):190 Trinity College Dublin, The University of Dublin Phylogenetic Trees Baum, D. (2008) Trait evolution on a phylogenetic tree: Relatedness, similarity, and the myth of evolutionary advancement. Nature Education 1(1):191 Trinity College Dublin, The University of Dublin TreeDraw http://webconnectron.appspot.com/Treedraw.html Description of controls Trinity College Dublin, The University of Dublin Don't forget to log out! Trinity College Dublin, The University of Dublin