Biochimie II — Introduction aux Outils Informatiques Appliqués à la Biologie Daniel Abegg abegg6@etu.unige.ch Assistants : Thomas Falguières — Francine Dreier Marie-Claude Blatter — Olivier Schaad — Thierry Soldati Salle Baud-Bovy BB03 25 février 2009 1 Introduction to biological databases Look for specific databases – Try to find an database (its corresponding home server address and the date of the latest update) dealing with : Dictyostelium discoideum Drosophila Cotton Restriction enzymes Gene Ontology Transcriptomic data (microarray data) http://www.ebi.ac.uk/ microarray-as/ae/ Human genes and genetic disorders Lipids http://dictybase.org/ http://flybase.org/ http://cottondb.org/ http://rebase.neb.com http: //www.geneontology.org diactybase flybase cottonDB REBASE ??? 23 Jan 2009 31 Juil 2008 ??? Gene Ontology 19 Mars 2008 EMBL-EBI Mai 2008 http: OMIM //www.ncbi.nlm.nih.gov/ sites/entrez?db=omim http://www.lipidmaps.org/ Lipid Maps daily 18 Juin 2008 Searching for sequences (1).... – Enter ”ken and barbie” in the text search box of UniProt website In which species do you find a sequence for this gene ? Does it mean that this gene exist only in this species ? This gene sequence is found in Drosophila melanogaster but it doesn’t mean that it is the only species. – Have a look at the Drosophila melanogaster UniProtKB entry for this gene. Find the protein, the RNA and corresponding genomic sequences : list their accession numbers (ACs) for each of these sequence categories. Sequence mRNA genomic DNA protein AC Database AJ012576 EMBL AB010261 EMBL O77459 uniprot – Could you get information about a person to contact in order to ask for an already cloned cDNA ? Mark Stapleton et al. (staple@fruitfly.org) published an article : A Drosophila fulllength cDNA resource 1 – Do the same search at NCBI In which species do you find a sequence for this gene ? The ”ken and barbie” gene search with NCBI gave the following species : Apis mellifera and Acyrthosiphon pisum. – Find a protein, a RNA and a corresponding genomic sequences for Drosophila melanogaster ’ken and barbie’ : list the accession numbers (ACs) for each of these sequence categories. Sequence mRNA genomic DNA protein AC Database NM 079109 NCBI NT 033778 NCBI NP 523833 NCBI – Look for AM948965 sequence at NCBI What does this accession number (AC) correspond to ? Find the corresponding publication. Display the sequence in different formats (Fasta, GenBank format...) This accession number corresponds to the : Homo sapiens neanderthalensis complete mitochondrial genome. The publication is : A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing (PMID : 18692465). The begin of the sequence in two different formats FASTA >gi|195972535|emb|AM948965.1| Homo sapiens neanderthalensis complete mitochondrial genome GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG GenBank LOCUS DEFINITION AM948965 16565 bp DNA circular PRI 20-AUG-2008 Homo sapiens neanderthalensis complete mitochondrial genome. Searching for sequences (2).... – Look for Mammoth, Dodo and Tyrannosaurus protein sequences. Animal Number of proteins Mammoth 95 Dodo 14 Tyrannosaurus 3 – Look for the complete genomic sequence of E.coli strain K12 at NCBI : how many genes are there ? Display the sequence in Fasta format. There are 4444 gene in the strain K12 from E.coli. >gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W3110 DNA, complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC 2 Genome overview – Go to Map Viewer at NCBI How many chimp chromosomes do you see ? The chimp has 25 chromosomes : 24 (2 different copies of chromosome 2) + 1). – Look for data available for human chromosome X : how many genes do you see ? The human chromosome X has 1529 genes. The AC of the 5’ telomeric sequence is NT 086925. The repetion on this gene starts about at the 16000 nucleotide position. Genomic databases : follow the links – Look for the EcoGene database. What type of data, species do you find ? Database of Escherichia coli Sequence and Function – Look for the gene gutQ. Find its chromosomal location. Left End: 2827835 ----------------- Clockwise ----------------- Right End: 2828800 Minute or Centisome (%) = 60.95 – Find the next gene on the same strand. The next gene is norV – Follow the link to UniProtKB/Swiss-Prot. Find the subcellular location of the protein. This protein seems to be located in the cytoplasm. Query UniProtKB – Find all the nuclear proteins of Dictyostelium discoideum. How many are there ? Do you think you get a complete set ? Why ? What are the shortest and largest known protein sequences ? There are 427 nuclear proteins of Dictyostelium but there could me more so this is not a complete set. The longest protein (Midasin) is 5900 amino acid long and the shortest is a DNAdirected RNA polymerases I, II, and III subunit rpabc4 with 46 a.a. 3 3D structure – Look for the data on human insulin in PDB (PDB accession number : 1A7F) What are the ’experimental data’ stored in the PDB database ? The expermental data are NMR specters. – To obtain the 3D structure, click on ’Quick pdb’ for example. Fig. 1 – 3D structure of human insulin (pdb entry 1A7F) Metabolic database (KEGG) – Look for data on glycolysis in KEGG. Find the name of the enzyme which catalyzes the conversion of fructose 1,6 P2 into glyceraldehyde 3P. The name of the enzyme is ALDO (entry : K01623) – Compare with the same pathway in sea urchin. Does the enzyme also exist in this species ? The enzyme also exists in sea urchin (entry : 548623) 4 Polymorphism database (dbSNP) – Look for information in the dbSNP database on the human blue eye variant rs12913832 (A= ancestral brown allele, G = blue allele) In which gene do we find this polymorphism ? Find the corresponding publication/citation. This polymorphism is found in the HERC2 gene. The corresponding publication is ”Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2 gene inhibiting OCA2 expression.” written by Eiberg H et al. (PMID : 18172690). – What is the Craig Venter’s ’eye color’ (look at the Celera genome assembly (= Craig Venter) sequence) ? Follow the link to the Alfred database to look for the population distribution of the ’blue eye allele’ (Google map). In which part of Europe is the blue allele the least prevalent ? Can you propose a hypothesis for this geographical distribution ? Craig Venter has the G allele this means that he has blue eyes. On the Google map it is seen that in Europe, Spain has the less people with blue eyes. This geographical distribution could be due to a mutation before the migrating and as it was not seen as a disadvantage in the north region, were there is less sun, it was kept. 5 2 Protein sequence analysis Primary sequence analysis – Find the physico-chemical parameters of the protein sequence (Seq 3 and Seq 4) (use ProtParam, ’MW, pI, Titration curve’ and SAPS) Look in particular for the number of amino acids, the PM (kD), the pI, the extinction molar coefficient and the total number of atoms and the chemical formula for each protein. nub a.a seq3 988 seq4 1127 PM (kD) pI λ=280nm M-1 cm-1 109770.1 9.16 52745 126363.8 6.68 122185 total atoms 15426 17823 formula C4811 H7715 N14130 O1448 S39 C5680 H8942 N1498 O1647 S56 Values found with ProtParam and compared with pI, Titration curve and SAPS. Topology - Transmembrane prediction – Can you predict the subcellular location of the protein (use PSORT) ? Can you predict the position of possible signal peptide (SignalP) ? Can you predict the position of possible transmembrane segment(s) ? Compare HMMTOP, TMHMM, TMpred results (Pay attention to the required sequence format !) PSORT SignalP seq3 nuclear (94.1%) NO seq4 cytoplamic (94.1%) NO HMMTOP NO Yes TMHMM NO Yes TMpred (Yes) Yes For sequence 3, TMpred predicted transmembrane segment but this is impossible because the protein is nuclear. Fig. 2 – Possible transmembranes domains in sequence 4 proposed by TMHMM tool 6 Post-tranlsational modification (PTM) prediction – Take your favorite protein sequence (Seq 3 and Seq 4) sequence. First look at the biological information available for each type of PTM Compare the results obtained with different phosphorylation prediction tools (NetPhos and NetPhosK). Compare the results obtained with different myristoylation prediction tools. Compare the results obtained with different glycosylation prediction tools (YinOYang, NetNGlyc). What conclusion can you draw about the presence of these PTMs in your sequence ? NetPhos (position) NetPhosK (position) seq3 Ser :41 Thr :7 Tyr :7 PKA : 873 seq4 Ser :31 Thr :16 Tyr :8 PKC : 367 Myristoylator NO NO NMT NO NO PKC phosphorylates sequence 4 at position 367 which is announce as a transmembrane domain (by TMpred) this means it’s impossible. With TMHMM (figure above) the position 367 is in the cytosol and therefore a possible site for PKC. Fig. 3 – O-GlcNAc sites in sequence 3 predicted by YinOYang. Fig. 4 – O-GlcNAc sites in sequence 4 predicted by YinOYang. NetNGlyc predicted for sequence 3 Nglycolysation sites but a nuclear protein can’t have this modification. Three sites are particularly probable for Nglycolysation in sequence 4 : at position 317 (76%), at position 360 (75%) and at position 530 (61%). The post-translational modification corresponds with the earlier data found. 7 BLAST – Look for one of the protein sequence(Seq 3 and Seq 4). Perform a BLAST search BLAST @NCBI BLAST @ExPASy Note that the first hit may correspond to the same sequence stored in different sequence databases (UniProtKB and RefSeq). To which protein family does your favorite protein sequence belong to ? Look at the data available for the best hit with BLAST @ExPASy and compare the annotation of the corresponding entry with the prediction results you get in the previous exercises. The sequence 3 is a hypothetical protein with the AC : AAK18922 (NCBI). The protein is in C. elegans :100% for NCBI and 73% for Expasy where the protein is uncharacterized. This protein could be involved in in post-transcriptional gene expression processes including mRNA and rRNA (info taken form NCBI). Uniprot entry (O01864) for this the protein evokes nucleic acid binding or nucleotide binding as molecular function. Those function are in pair with the nuclear location of the protein. Sequence 4 is also a hypothetical protein, AC is NP 001023542 (NCBI). It is also in C.elegans : 100% for NCBI and 96% for Expasy where the protein is uncharacterized. The regions indicates take the protein could be involved as a cation transport ATPase and a E1-E2 ATPase (NCBI). Uniprot entry (Q9N323) for this the protein proposes hydrolase as molecular function. The protein is also said to be in the membrane and it has a transmembrane domain (uniprot) which is in correlation with previous results. 8 From sequencing to biological information – Read the following sequencing gel What is the function of the corresponding gene product ? Fig. 5 – Sequence : cagaagaggccatcaagcacatcactgtccttctgccatggccc The NCBI blast finds that it is the insulin mRNA from the Homo sapiens. Insulin is function a an hormone secreted when the blood glucose concentration is height an therefore it activates glucose uptake by the liver. BLAST specificity – Take a random DNA sequence for example : attatacgtatataattccgataatcgcgctga Using BLAST @NCBI try to find it in the human genome It is impossible to find this sequence in the human genome. The best hit cover only about half of the random sequence. – Perform a BLAST search with a fragment of the insulin gene : ctgggcgggg gccctggtgc aggcagcctg Repeat the exercise using a mutated insulin sequence : ctgggcgggg gccctggtgc aggcagcatg. Insulin is also found in the NCBI blast with the mutation. 9 – Have a look at the mammoth genome project Does the gene ’ken and barbie’ exist in mammoth ? The ”ken and barbie” gene was not found in the mammoth genome project database. Summary exercise – By using the following human protein sequence, do the most complete and primary sequence analysis including the subcellular location and PTM prediction (try to be as close as possible to the biology interest in the order of the analysis). The NCBI blast found the corresponding protein which is fibronectin. This protein has a molecular weight of 262606.5 (kD), a pI of 5.45 and it’s formula is C11486 H17822 N3206 O3681 S90 (ProtParam). There are no transmembrane domains with HMMTOP, TMHMM but TMpred found some which is not logical for fibronectin because this protein is secreted, present in extracellular space and extracellular matrix (uniprot : P02751). Signal sequences were found with SignalP. NetPhos predicted phosphorylation site at serine 79, threonine 57 and tyrosine 25. NetPhosK found at PKC site at position 29 and there is no myristoylation (Myristoylator and NMT). Fibronectin seems to have some Nglycolysation (NetNGly) site like at position 430 (76%), at 542 (72%) and 1244 (71%) and many O-GLcNAc site (YinOYang) like shown on the picture below. Fig. 6 – O-GLcNAc site for fibronectin (YinOYang) 10 3 Phylogenetic analysis Start playing with... – .... Philophylo Compare some of the trees obtained depending on the input sequences and/or the number of input sequences. Which protein (of those provided by this dataset) has been the most ’conserved’ during the course of evolution ? Do you have an idea why ? The histine H4 is mostly conserved because of it’s important function to compress DNA. – Which protein is the most ’universal’ (= present in most of the species) ? The most universal protein is the Cytochrome B. Compare protein sequence by multiple alignment – Here are the sequences of 5 orthologous genes (i.e. the same gene in 5 different species) ARP2 A ARP2 B ARP2 C ARP2 D ARP2 E Do a multiple alignment by using one of the alignment tool available on ExPASy. Compare the results obtained by the different tools. T-COFFEE Output CLUSTAL FORMAT for T-COFFEE Version_5.05 [http://www.tcoffee.org], SCORE=78, Nseq=5, Len=60 ARP2_B ARP2_E ARP2_C ARP2_D ARP2_A MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MES---APIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE *:* :* ******** *** *** . **::****::*: : *::::**:***:* CLUSTALW 2.0.10 multiple sequence alignment ARP2_C ARP2_D ARP2_B ARP2_E ARP2_A MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE *:* :* ******** *** *** . **::****::*: : *::::**:***:* 11 60 60 60 60 57 Muscle >ARP2_A MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE >ARP2_E MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE >ARP2_C MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE >ARP2_D MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE >ARP2_B MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE In conclusion the Gap is at the same place. Manual phylogenetic analysis – Look for the multiple sequence alignment obtained above Fill-up the following ’distance-matrix’, by counting the differences between the sequences (if necessary, re-do an alignment with the sequences 2 by 2). A B C D E A – 24 24 24 25 B – – 10 10 15 C – – – 0 13 D – – – – 13 E – – – – – – Knowing that species are : Caenorhabditis briggsae Drosophila melanogaster Homo sapiens Mus musculus Schizosaccharomyces pombe ...which sequence is likely to correspond to which species ? A=Schizosaccharomyces pombe B=Caenorhabditis briggsae C=Mus musculus D=Homo sapiens E=Drosophila melanogaster 12 - Try to draw a phylogenetic tree Fig. 7 – phylogenetic tree of a orthologous gene sequence of species : Schizosaccharomyces pombe, Caenorhabditis briggsae, Mus musculus, Homo sapiens and Drosophila melanogaster Phylogenetic analysis – Get the ARP2 protein sequences from human, mouse, fruit fly, worm and fission yeast from UniProtKB in Fasta format. The sequences are those of the previous exercise. Homo sapiens (Human) >sp|P61160|ARP2_HUMAN Actin-related protein 2 OS=Homo sapiens GN=ACTR2 PE=1 SV=1 MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV LADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR Mus musculus (Mouse) >sp|P61161|ARP2_MOUSE Actin-related protein 2 OS=Mus musculus GN=Actr2 PE=1 SV=1 MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV LADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR Drosophila melanogaster (Fruit fly) >sp|P45888|ARP2_DROME Actin-related protein 2 OS=Drosophila melanogaster GN=Arp14D PE=2 SV=2 MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV LAEVTKDRDGFWMSKQEYQEQGLKVLQKLQKISH 13 Caenorhabditis briggsae >sp|Q61JZ2|ARP2_CAEBR Actin-related protein 2 OS=Caenorhabditis briggsae GN=arx-2 PE=3 SV=1 MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE CSQLRQMLDINYPMDNGIVRNWDDMGHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR EKMFQVMFEQYGFNSIYVAAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTRRL DIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVLSQ QYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKHIV LSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAVLA NLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA Schizosaccharomyces pombe (Fission yeast) >sp|Q9UUJ1|ARP2_SCHPO Actin-related protein 2 OS=Schizosaccharomyces pombe GN=arp2 PE=1 SV=1 MESAPIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDEAEA VRSLLQVKYPMENGIIRDFEEMNQLWDYTFFEKLKIDPRGRKILLTEPPMNPVANREKMC ETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVGRLDV AGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVLMRNY TLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRAIVLS GGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAVLADI MAQNDHMWVSKAEWEEYGVRALDKLGPRTT – Reconstruct phylogenetic trees using the ’One click’ analysis methods provided at http://www.phylogeny.fr/. In case of server problems, use alternative servers for phylogenetic analysis. Fig. 8 – Phylogenetic tree of the ARP2 protein sequences from human, mouse, fruit fly, worm and fission yeast done with http ://www.phylogeny.fr/ One click method. 14 Phylogenetic analysis – How many distinct trees do you have on this figure ? Fig. 9 – All the trees are the same. There is only one tree. – List the positions on the following trees, where there is - a gene duplication event : 2 and 10 - a speciation event : 1, 3, 4, 5, 6, 7, 8, 9, 11 and 12 Fig. 10 – The number 2 and 10 are duplication and the other ones are speciation 15 The Tree of Life – Construct a phylogenetic tree based on dataset 4 using the ’one click’ method at http: //www.phylogeny.fr/. To get the correspondence between the 5 letter codes (i.e. ARATH, BACSU) and the species, query the UniProt website or look at the document Controlled vocabulary of species@UniProt – Explain the tree. Locate gene duplication and/or speciation events. Does the resulting tree correspond to the species tree (3 kingdoms (Eucaryota, Archae, Bacteria)) ? Fig. 11 – The big separation which is indicated as duplication is the only duplication. The other events are specication. All three kingdoms are present on the tree – Try to explain the position of EFTU ARATH in the tree. The EFTU ARATH codes for the chloroplast from the chloroplast genome. The origin of the chloroplast is from the bacteria which explains it’s position in the tree. 16 Exercise 6 – If you are still alive, construct a tree with your favorite protein (i.e. insulin)... Fig. 12 – The phylogenetic trees for the ATP synthase subunit a. The sequences were found on uniprot. The branchiostoma floridae and the salmo salar are fishes. The sus scrofa is a wild boar. The anopheles gambiae and the aedes aegypti are insect and the metridium senile is a sort of anemone. All the separation are speciation. 17 4 Introduction to gene prediction non-protein coding RNA (ncRNA) gene prediction – In a C.elegans genomic sequence (cosmid) : ..look for the presence of tRNA gene(s) with tRNAscan-SE. Use the default ’search mode’ and the source ’Eukaryotic’. Have a look at the tRNA structure. Sequence Name -------Cosmid tRNA # -----1 tRNA Begin ---169 Bounds End -----238 tRNA Type ---His Anti Codon ----ATG Intron Begin ----0 Fig. 13 – Image of the tRNA of the given sequence There is one tRNA. 18 Bounds End ---0 Cove Score -----20.56 ’Ab initio’ protein-coding gene prediction – Get gene 1, a genomic sequence from C.elegans, and compare the results of gene predictions obtained by different programs (pay attention to the format of the submitted sequence) : HMMgene Netgene2 WebGene (Genebuilder) (option : ”First and last coding exons : disabled”) Draw a shema describing the different predicted gene structures with the positions (numbering) of the exon and intron boundaries. Fig. 14 – Exon and intron boundaries with different tools on the C.elegans gene. The mRNA (EST) of the same gene (Blastn from NCBI). – Compare the results obtained by HMM if you choose ’human’ instead of C.elegans as organism. Why are they different ? The boundaries of the found exons are the same on C.elegans but there are three more on the complementary strand in the human : 1290 to 1418, 1461 to 1650 and 1443 to 2522. 19 Protein-coding gene prediction and the use of sequenced mRNAs (ESTs) – Do a Blastn search at NCBI with the genomic sequence (gene 1) Select C. elegans ESTs (mRNAs). How many different ”RNA” sequence(s) can you retrieve ? There are 3 different RNA with principaly 4 exons. – Retrieve the sequence of the mRNA (EST) BJ818152 in Fasta format. Align this ESTs with the genomic sequence by using SIM4 (a alignment tool specific for cDNA and genomic sequence alignment. SIM4 takes care of the intron/exon boundaries). Compare the intron/exon boundaries numbering with the results obtained by the prediction programs (previous exercise). >BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis elegans cDNA clone yk1685h11 3’, mRNA sequence. TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATT CTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGC CTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTG TTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCC AGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTC TTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACG TGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAAC GCAGGTTTCGACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAAT TAAACCTACAAATAAAAATGAGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAA AACCGAAAACGAGAAAATTATTCTATTATGACAGATAGAATAAGTTAAAATGGGAAGAGT GCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCGTGGGCAAGGTAAGCGACATT GTTCGATGAA The EST schema is on the previous figure. – Do the same job with another EST (with different exon/intron boundaries if they exist). The BJ775052 EST was chosen and the boundaries are about the same : from 969 to 1406, from 1452 to 1661 and 1914 to 2019. 20 Translation – Translate the EST BJ818152 sequence by using one of the tools provided at ExPASy (pay attention to the EST sequence orientation !) Select several potential ORFs (open reading frame). Using BlastP (@ExPASy) compare each potential ORF with already known C.elegans protein sequences Identify the correct protein sequence. Try to find the function of the protein. This is the selected ORF form the translation of the EST BJ818152 (3’5’ because the EST was 3’) : M A R Q T K L K Q Q V K K R Q E G T E K T A K Q T C K K A A V L S A K Y R V K N S R Q I V G N V A K Y P V K T K R N D A I D R A A H P G H G K R L V R T D G K V Q I F L S G K I R W T V L Y R I K N K K G T H G Q E Q V T A V A G L S L D A I L A K R N Q T E D F R R N K A V R A A K A A A N K E K K A S Q P K P R V G G K R Fig. 15 – The BlastP from the above ORF. The found protein is the 60S ribosomal protein L24 from Caenorhabditis elegans. It’s molecular function is a structural constituent of ribosome (uniprot entry : O01868) – For fun : Translate directly the genomic sequence (gene 1) and try to find the correct protein sequence. It is impossible to translate the genomic sequence a to find a protein. 21 If you are still alive... – ...try to find the correct protein sequence encoded by the following genomic sequence from C. elegans (gene 2) The protein corresponding to the gene 2 will be search. First the exon on the gene 2 are located on the complementary strand on the following positions : 789 to 1111, 1410 to 1636 and 1688 to 1845 (HMM). A blastn search is done on the gene 2 (NCBI) and a sequence with similar exons is chosen. >OSTR075F6_1 AD-wrmcDNA Caenorhabditis elegans cDNA, mRNA sequence AATTTGCCCGGGTTCCTTCTTCAACGGATCCTCTTCCTCGTCCTTAACTCTTCTGATCTT CTCCTGTTTTCGATACTTCGCCCGCCGATTCTGAAACCACACTTGAACTCGGGCTTCAGT TAAATCAATTCTCATTGCAATTTCTTCTCGTGTATAAATATCTGGATAATGAGTTTCACA GAATGATCTTTCCAACTCCTTCAGTTGTCCTGATGTGAATGTGGTACGGATTCGGCGTTG TTTTCGACGCTCGGCAGGGTTCAAAGGAGCTCCACCGGTTGAGCAGAGAGCACCAACAAG AGAACTTCTTGGCAGTCCGTTCAAAACATTGCTACTTGTCCTCTGAATCGTATCACTTCC AATTAATTGTGATTTTTGATACAACTGATATTGTAGACCAGTATTAAAAAAAGCTTGTAG TGAATCGTGTGTTGTATTACGGTAGTTTGATGAGGAAGATGATGAAGATGTGGAATTGCC CGCTGAAGAGCTTGAAGTATTGTGAGCAGTTGTCAAGGCACGTCCACTTTGT After that, this sequence is translated (NCBI Translat tool) and a ORF is chosen. The chosen ORF is from 3’5’ because we search on the complementary strand : M R I D L T E A R V Q V W F Q N R R A K Y R K Q E K I R R V K D E E E D P L K K E P G Q I Finally a BlastP is done. The protein is the homeobox protein unc-4. It is a transcription factor (uniprot entry P29506) which could explain the little mRNA found in the blastn. (( J’atteste que dans ce texte toute affirmation qui n’est pas le fruit de ma réflexion personnelle est attribuée à sa source et que tout passage recopié d’une autre source est en outre placé entre guillemets. )) Daniel Abegg 22