Biochimie II

advertisement
Biochimie II
—
Introduction aux Outils Informatiques
Appliqués à la Biologie
Daniel Abegg
abegg6@etu.unige.ch
Assistants :
Thomas Falguières — Francine Dreier
Marie-Claude Blatter — Olivier Schaad — Thierry Soldati
Salle Baud-Bovy BB03
25 février 2009
1
Introduction to biological databases
Look for specific databases
– Try to find an database (its corresponding home server address and the date of the
latest update) dealing with :
Dictyostelium discoideum
Drosophila
Cotton
Restriction enzymes
Gene Ontology
Transcriptomic data
(microarray data)
http://www.ebi.ac.uk/
microarray-as/ae/
Human genes and genetic
disorders
Lipids
http://dictybase.org/
http://flybase.org/
http://cottondb.org/
http://rebase.neb.com
http:
//www.geneontology.org
diactybase
flybase
cottonDB
REBASE
???
23 Jan 2009
31 Juil 2008
???
Gene Ontology
19 Mars 2008
EMBL-EBI
Mai 2008
http:
OMIM
//www.ncbi.nlm.nih.gov/
sites/entrez?db=omim
http://www.lipidmaps.org/ Lipid Maps
daily
18 Juin 2008
Searching for sequences (1)....
– Enter ”ken and barbie” in the text search box of UniProt website
In which species do you find a sequence for this gene ?
Does it mean that this gene exist only in this species ?
This gene sequence is found in Drosophila melanogaster but it doesn’t mean that it is
the only species.
– Have a look at the Drosophila melanogaster UniProtKB entry for this gene.
Find the protein, the RNA and corresponding genomic sequences : list their accession
numbers (ACs) for each of these sequence categories.
Sequence
mRNA
genomic DNA
protein
AC
Database
AJ012576 EMBL
AB010261 EMBL
O77459
uniprot
– Could you get information about a person to contact in order to ask for an already
cloned cDNA ?
Mark Stapleton et al. (staple@fruitfly.org) published an article : A Drosophila fulllength cDNA resource
1
– Do the same search at NCBI
In which species do you find a sequence for this gene ?
The ”ken and barbie” gene search with NCBI gave the following species : Apis mellifera
and Acyrthosiphon pisum.
– Find a protein, a RNA and a corresponding genomic sequences for Drosophila melanogaster
’ken and barbie’ : list the accession numbers (ACs) for each of these sequence categories.
Sequence
mRNA
genomic DNA
protein
AC
Database
NM 079109 NCBI
NT 033778 NCBI
NP 523833 NCBI
– Look for AM948965 sequence at NCBI
What does this accession number (AC) correspond to ?
Find the corresponding publication.
Display the sequence in different formats (Fasta, GenBank format...)
This accession number corresponds to the : Homo sapiens neanderthalensis complete
mitochondrial genome. The publication is : A complete Neandertal mitochondrial
genome sequence determined by high-throughput sequencing (PMID : 18692465).
The begin of the sequence in two different formats
FASTA
>gi|195972535|emb|AM948965.1| Homo sapiens neanderthalensis complete mitochondrial genome
GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGG
GenBank
LOCUS
DEFINITION
AM948965
16565 bp
DNA
circular PRI 20-AUG-2008
Homo sapiens neanderthalensis complete mitochondrial genome.
Searching for sequences (2)....
– Look for Mammoth, Dodo and Tyrannosaurus protein sequences.
Animal
Number of proteins
Mammoth
95
Dodo
14
Tyrannosaurus
3
– Look for the complete genomic sequence of E.coli strain K12 at NCBI : how many
genes are there ?
Display the sequence in Fasta format.
There are 4444 gene in the strain K12 from E.coli.
>gi|85674274|dbj|AP009048.1| Escherichia coli str. K12 substr. W3110 DNA, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
2
Genome overview
– Go to Map Viewer at NCBI How many chimp chromosomes do you see ?
The chimp has 25 chromosomes : 24 (2 different copies of chromosome 2) + 1).
– Look for data available for human chromosome X : how many genes do you see ?
The human chromosome X has 1529 genes. The AC of the 5’ telomeric sequence is
NT 086925. The repetion on this gene starts about at the 16000 nucleotide position.
Genomic databases : follow the links
– Look for the EcoGene database. What type of data, species do you find ?
Database of Escherichia coli Sequence and Function
– Look for the gene gutQ.
Find its chromosomal location.
Left End: 2827835 ----------------- Clockwise ----------------- Right End: 2828800
Minute or Centisome (%) = 60.95
– Find the next gene on the same strand.
The next gene is norV
– Follow the link to UniProtKB/Swiss-Prot.
Find the subcellular location of the protein.
This protein seems to be located in the cytoplasm.
Query UniProtKB
– Find all the nuclear proteins of Dictyostelium discoideum. How many are there ?
Do you think you get a complete set ? Why ?
What are the shortest and largest known protein sequences ?
There are 427 nuclear proteins of Dictyostelium but there could me more so this is
not a complete set.
The longest protein (Midasin) is 5900 amino acid long and the shortest is a DNAdirected RNA polymerases I, II, and III subunit rpabc4 with 46 a.a.
3
3D structure
– Look for the data on human insulin in PDB (PDB accession number : 1A7F)
What are the ’experimental data’ stored in the PDB database ?
The expermental data are NMR specters.
– To obtain the 3D structure, click on ’Quick pdb’ for example.
Fig. 1 – 3D structure of human insulin (pdb entry 1A7F)
Metabolic database (KEGG)
– Look for data on glycolysis in KEGG.
Find the name of the enzyme which catalyzes the conversion of fructose 1,6 P2 into
glyceraldehyde 3P.
The name of the enzyme is ALDO (entry : K01623)
– Compare with the same pathway in sea urchin. Does the enzyme also exist in this
species ?
The enzyme also exists in sea urchin (entry : 548623)
4
Polymorphism database (dbSNP)
– Look for information in the dbSNP database on the human blue eye variant rs12913832
(A= ancestral brown allele, G = blue allele)
In which gene do we find this polymorphism ?
Find the corresponding publication/citation.
This polymorphism is found in the HERC2 gene.
The corresponding publication is ”Blue eye color in humans may be caused by a perfectly associated founder mutation in a regulatory element located within the HERC2
gene inhibiting OCA2 expression.” written by Eiberg H et al. (PMID : 18172690).
– What is the Craig Venter’s ’eye color’ (look at the Celera genome assembly (= Craig
Venter) sequence) ?
Follow the link to the Alfred database to look for the population distribution of the
’blue eye allele’ (Google map).
In which part of Europe is the blue allele the least prevalent ?
Can you propose a hypothesis for this geographical distribution ?
Craig Venter has the G allele this means that he has blue eyes.
On the Google map it is seen that in Europe, Spain has the less people with blue eyes.
This geographical distribution could be due to a mutation before the migrating and
as it was not seen as a disadvantage in the north region, were there is less sun, it was
kept.
5
2
Protein sequence analysis
Primary sequence analysis
– Find the physico-chemical parameters of the protein sequence (Seq 3 and Seq 4)
(use ProtParam, ’MW, pI, Titration curve’ and SAPS) Look in particular for the
number of amino acids, the PM (kD), the pI, the extinction molar coefficient and the
total number of atoms and the chemical formula for each protein.
nub a.a
seq3
988
seq4
1127
PM (kD) pI λ=280nm M-1 cm-1
109770.1 9.16
52745
126363.8 6.68
122185
total atoms
15426
17823
formula
C4811 H7715 N14130 O1448 S39
C5680 H8942 N1498 O1647 S56
Values found with ProtParam and compared with pI, Titration curve and SAPS.
Topology - Transmembrane prediction
– Can you predict the subcellular location of the protein (use PSORT) ?
Can you predict the position of possible signal peptide (SignalP) ?
Can you predict the position of possible transmembrane segment(s) ? Compare HMMTOP, TMHMM, TMpred results
(Pay attention to the required sequence format !)
PSORT
SignalP
seq3 nuclear (94.1%)
NO
seq4 cytoplamic (94.1%) NO
HMMTOP
NO
Yes
TMHMM
NO
Yes
TMpred
(Yes)
Yes
For sequence 3, TMpred predicted transmembrane segment but this is impossible because the protein is nuclear.
Fig. 2 – Possible transmembranes domains in sequence 4 proposed by TMHMM tool
6
Post-tranlsational modification (PTM) prediction
– Take your favorite protein sequence (Seq 3 and Seq 4) sequence.
First look at the biological information available for each type of PTM
Compare the results obtained with different phosphorylation prediction tools (NetPhos
and NetPhosK).
Compare the results obtained with different myristoylation prediction tools.
Compare the results obtained with different glycosylation prediction tools (YinOYang,
NetNGlyc).
What conclusion can you draw about the presence of these PTMs in your sequence ?
NetPhos (position)
NetPhosK (position)
seq3 Ser :41 Thr :7 Tyr :7
PKA : 873
seq4 Ser :31 Thr :16 Tyr :8
PKC : 367
Myristoylator
NO
NO
NMT
NO
NO
PKC phosphorylates sequence 4 at position 367 which is announce as a transmembrane
domain (by TMpred) this means it’s impossible. With TMHMM (figure above) the position 367 is in the cytosol and therefore a possible site for PKC.
Fig. 3 – O-GlcNAc sites in sequence 3 predicted by YinOYang.
Fig. 4 – O-GlcNAc sites in sequence 4 predicted by YinOYang.
NetNGlyc predicted for sequence 3 Nglycolysation sites but a nuclear protein can’t
have this modification. Three sites are particularly probable for Nglycolysation in sequence 4 : at position 317 (76%), at position 360 (75%) and at position 530 (61%).
The post-translational modification corresponds with the earlier data found.
7
BLAST
– Look for one of the protein sequence(Seq 3 and Seq 4).
Perform a BLAST search
BLAST @NCBI
BLAST @ExPASy
Note that the first hit may correspond to the same sequence stored in different sequence databases (UniProtKB and RefSeq).
To which protein family does your favorite protein sequence belong to ?
Look at the data available for the best hit with BLAST @ExPASy and compare the
annotation of the corresponding entry with the prediction results you get in the previous exercises.
The sequence 3 is a hypothetical protein with the AC : AAK18922 (NCBI). The protein
is in C. elegans :100% for NCBI and 73% for Expasy where the protein is uncharacterized. This protein could be involved in in post-transcriptional gene expression processes
including mRNA and rRNA (info taken form NCBI). Uniprot entry (O01864) for this
the protein evokes nucleic acid binding or nucleotide binding as molecular function.
Those function are in pair with the nuclear location of the protein.
Sequence 4 is also a hypothetical protein, AC is NP 001023542 (NCBI). It is also in
C.elegans : 100% for NCBI and 96% for Expasy where the protein is uncharacterized.
The regions indicates take the protein could be involved as a cation transport ATPase
and a E1-E2 ATPase (NCBI). Uniprot entry (Q9N323) for this the protein proposes
hydrolase as molecular function. The protein is also said to be in the membrane and it
has a transmembrane domain (uniprot) which is in correlation with previous results.
8
From sequencing to biological information
– Read the following sequencing gel
What is the function of the corresponding gene product ?
Fig. 5 – Sequence : cagaagaggccatcaagcacatcactgtccttctgccatggccc
The NCBI blast finds that it is the insulin mRNA from the Homo sapiens. Insulin
is function a an hormone secreted when the blood glucose concentration is height an
therefore it activates glucose uptake by the liver.
BLAST specificity
– Take a random DNA sequence
for example : attatacgtatataattccgataatcgcgctga
Using BLAST @NCBI try to find it in the human genome
It is impossible to find this sequence in the human genome. The best hit cover only
about half of the random sequence.
– Perform a BLAST search with a fragment of the insulin gene : ctgggcgggg gccctggtgc
aggcagcctg
Repeat the exercise using a mutated insulin sequence : ctgggcgggg gccctggtgc aggcagcatg.
Insulin is also found in the NCBI blast with the mutation.
9
– Have a look at the mammoth genome project
Does the gene ’ken and barbie’ exist in mammoth ?
The ”ken and barbie” gene was not found in the mammoth genome project database.
Summary exercise
– By using the following human protein sequence, do the most complete and primary
sequence analysis including the subcellular location and PTM prediction (try to be as
close as possible to the biology interest in the order of the analysis).
The NCBI blast found the corresponding protein which is fibronectin. This protein has a
molecular weight of 262606.5 (kD), a pI of 5.45 and it’s formula is C11486 H17822 N3206 O3681 S90
(ProtParam). There are no transmembrane domains with HMMTOP, TMHMM but
TMpred found some which is not logical for fibronectin because this protein is secreted,
present in extracellular space and extracellular matrix (uniprot : P02751). Signal sequences were found with SignalP.
NetPhos predicted phosphorylation site at serine 79, threonine 57 and tyrosine 25. NetPhosK found at PKC site at position 29 and there is no myristoylation (Myristoylator
and NMT). Fibronectin seems to have some Nglycolysation (NetNGly) site like at position 430 (76%), at 542 (72%) and 1244 (71%) and many O-GLcNAc site (YinOYang)
like shown on the picture below.
Fig. 6 – O-GLcNAc site for fibronectin (YinOYang)
10
3
Phylogenetic analysis
Start playing with...
– .... Philophylo
Compare some of the trees obtained depending on the input sequences and/or the
number of input sequences.
Which protein (of those provided by this dataset) has been the most ’conserved’ during
the course of evolution ?
Do you have an idea why ?
The histine H4 is mostly conserved because of it’s important function to compress
DNA.
– Which protein is the most ’universal’ (= present in most of the species) ?
The most universal protein is the Cytochrome B.
Compare protein sequence by multiple alignment
– Here are the sequences of 5 orthologous genes (i.e. the same gene in 5 different species)
ARP2 A ARP2 B ARP2 C ARP2 D ARP2 E
Do a multiple alignment by using one of the alignment tool available on ExPASy. Compare the results obtained by the different tools.
T-COFFEE Output
CLUSTAL FORMAT for T-COFFEE Version_5.05 [http://www.tcoffee.org], SCORE=78, Nseq=5, Len=60
ARP2_B
ARP2_E
ARP2_C
ARP2_D
ARP2_A
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MES---APIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
*:*
:* ******** *** *** . **::****::*: : *::::**:***:*
CLUSTALW 2.0.10 multiple sequence alignment
ARP2_C
ARP2_D
ARP2_B
ARP2_E
ARP2_A
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
*:*
:* ******** *** *** . **::****::*: : *::::**:***:*
11
60
60
60
60
57
Muscle
>ARP2_A
MESAP---IVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDE
>ARP2_E
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
>ARP2_C
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
>ARP2_D
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
>ARP2_B
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
In conclusion the Gap is at the same place.
Manual phylogenetic analysis
– Look for the multiple sequence alignment obtained above
Fill-up the following ’distance-matrix’, by counting the differences between the sequences (if necessary, re-do an alignment with the sequences 2 by 2).
A
B
C
D
E
A
–
24
24
24
25
B
–
–
10
10
15
C
–
–
–
0
13
D
–
–
–
–
13
E
–
–
–
–
–
– Knowing that species are :
Caenorhabditis briggsae
Drosophila melanogaster
Homo sapiens
Mus musculus
Schizosaccharomyces pombe
...which sequence is likely to correspond to which species ?
A=Schizosaccharomyces pombe
B=Caenorhabditis briggsae
C=Mus musculus
D=Homo sapiens
E=Drosophila melanogaster
12
- Try to draw a phylogenetic tree
Fig. 7 – phylogenetic tree of a orthologous gene sequence of species : Schizosaccharomyces pombe, Caenorhabditis briggsae, Mus musculus, Homo sapiens and Drosophila
melanogaster
Phylogenetic analysis
– Get the ARP2 protein sequences from human, mouse, fruit fly, worm and fission yeast
from UniProtKB in Fasta format. The sequences are those of the previous exercise.
Homo sapiens (Human)
>sp|P61160|ARP2_HUMAN Actin-related protein 2 OS=Homo sapiens GN=ACTR2 PE=1 SV=1
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR
EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL
VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH
IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV
LADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR
Mus musculus (Mouse)
>sp|P61161|ARP2_MOUSE Actin-related protein 2 OS=Mus musculus GN=Actr2 PE=1 SV=1
MDSQGRKVVVCDNGTGFVKCGYAGSNFPEHIFPALVGRPIIRSTTKVGNIEIKDLMVGDE
ASELRSMLEVNYPMENGIVRNWDDMKHLWDYTFGPEKLNIDTRNCKILLTEPPMNPTKNR
EKIVEVMFETYQFSGVYVAIQAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFSLPHLTR
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRMIKEKLCYVGYNIEQEQKLALETTVL
VESYTLPDGRIIKVGGERFEAPEALFQPHLINVEGVGVAELLFNTIQAADIDTRSEFYKH
IVLSGGSTMYPGLPSRLERELKQLYLERVLKGDVEKLSKFKIRIEDPPRRKHMVFLGGAV
LADIMKDKDNFWMTRQEYQEKGVRVLEKLGVTVR
Drosophila melanogaster (Fruit fly)
>sp|P45888|ARP2_DROME Actin-related protein 2 OS=Drosophila melanogaster GN=Arp14D PE=2 SV=2
MDSKGRNVIVCDNGTGFVKCGYAGSNFPTHIFPSMVGRPMIRAVNKIGDIEVKDLMVGDE
ASQLRSLLEVSYPMENGVVRNWDDMCHVWDYTFGPKKMDIDPTNTKILLTEPPMNPTKNR
EKMIEVMFEKYGFDSAYIAIQAVLTLYAQGLISGVVIDSGDGVTHICPVYEEFALPHLTR
RLDIAGRDITRYLIKLLLLRGYAFNHSADFETVRIMKEKLCYIGYDIEMEQRLALETTVL
VESYTLPDGRVIKVGGERFEAPEALFQPHLINVEGPGIAELAFNTIQAADIDIRPELYKH
IVLSGGSTMYPGLPSRLEREIKQLYLERVLKNDTEKLAKFKIRIEDPPRRKDMVFIGGAV
LAEVTKDRDGFWMSKQEYQEQGLKVLQKLQKISH
13
Caenorhabditis briggsae
>sp|Q61JZ2|ARP2_CAEBR Actin-related protein 2 OS=Caenorhabditis briggsae GN=arx-2 PE=3 SV=1
MDSQGRKVIVVDNGTGFVKCGYAGTNFPAHIFPSMVGRPIVRSTQRVGNIEIKDLMVGEE
CSQLRQMLDINYPMDNGIVRNWDDMGHVWDHTFGPEKLDIDPKECKLLLTEPPLNPNSNR
EKMFQVMFEQYGFNSIYVAAVLTLYAQGLLTGVVVDSGDGVTHICPVYEGFALHHLTRRL
DIAGRDITKYLIKLLLQRGYNFNHSADFETVRQMKEKLCYIAYDVEQEERLALETTVLSQ
QYTLPDGRVIRLGGERFEAPEILFQPHLINVEKAGLSELLFGCIQASDIDTRLDFYKHIV
LSGGTTMYPGLPSRLEKELKQLYLDRVLHGNTDAFQKFKIRIEAPPSRKHMVFLGGAVLA
NLMKDRDQDFWVSKKEYEEGGIARCMAKLGIKA
Schizosaccharomyces pombe (Fission yeast)
>sp|Q9UUJ1|ARP2_SCHPO Actin-related protein 2 OS=Schizosaccharomyces pombe GN=arp2 PE=1 SV=1
MESAPIVLDNGTGFVKVGYAKDNFPRFQFPSIVGRPILRAEEKTGNVQIKDVMVGDEAEA
VRSLLQVKYPMENGIIRDFEEMNQLWDYTFFEKLKIDPRGRKILLTEPPMNPVANREKMC
ETMFERYGFGGVYVAIQAVLSLYAQGLSSGVVVDSGDGVTHIVPVYESVVLNHLVGRLDV
AGRDATRYLISLLLRKGYAFNRTADFETVREMKEKLCYVSYDLELDHKLSEETTVLMRNY
TLPDGRVIKVGSERYECPECLFQPHLVGSEQPGLSEFIFDTIQAADVDIRKYLYRAIVLS
GGSSMYAGLPSRLEKEIKQLWFERVLHGDPARLPNFKVKIEDAPRRRHAVFIGGAVLADI
MAQNDHMWVSKAEWEEYGVRALDKLGPRTT
– Reconstruct phylogenetic trees using the ’One click’ analysis methods provided at
http://www.phylogeny.fr/.
In case of server problems, use alternative servers for phylogenetic analysis.
Fig. 8 – Phylogenetic tree of the ARP2 protein sequences from human, mouse, fruit fly,
worm and fission yeast done with http ://www.phylogeny.fr/ One click method.
14
Phylogenetic analysis
– How many distinct trees do you have on this figure ?
Fig. 9 – All the trees are the same. There is only one tree.
– List the positions on the following trees, where there is
- a gene duplication event : 2 and 10
- a speciation event : 1, 3, 4, 5, 6, 7, 8, 9, 11 and 12
Fig. 10 – The number 2 and 10 are duplication and the other ones are speciation
15
The Tree of Life
– Construct a phylogenetic tree based on dataset 4 using the ’one click’ method at http:
//www.phylogeny.fr/.
To get the correspondence between the 5 letter codes (i.e. ARATH, BACSU) and the
species, query the UniProt website or look at the document Controlled vocabulary of
species@UniProt
– Explain the tree. Locate gene duplication and/or speciation events.
Does the resulting tree correspond to the species tree (3 kingdoms (Eucaryota, Archae,
Bacteria)) ?
Fig. 11 – The big separation which is indicated as duplication is the only duplication.
The other events are specication.
All three kingdoms are present on the tree
– Try to explain the position of EFTU ARATH in the tree.
The EFTU ARATH codes for the chloroplast from the chloroplast genome. The origin
of the chloroplast is from the bacteria which explains it’s position in the tree.
16
Exercise 6
– If you are still alive, construct a tree with your favorite protein (i.e. insulin)...
Fig. 12 – The phylogenetic trees for the ATP synthase subunit a. The sequences were
found on uniprot.
The branchiostoma floridae and the salmo salar are fishes. The sus scrofa is a wild boar.
The anopheles gambiae and the aedes aegypti are insect and the metridium senile is a
sort of anemone.
All the separation are speciation.
17
4
Introduction to gene prediction
non-protein coding RNA (ncRNA) gene prediction
– In a C.elegans genomic sequence (cosmid) :
..look for the presence of tRNA gene(s) with tRNAscan-SE. Use the default ’search
mode’ and the source ’Eukaryotic’. Have a look at the tRNA structure.
Sequence
Name
-------Cosmid
tRNA #
-----1
tRNA
Begin
---169
Bounds
End
-----238
tRNA
Type
---His
Anti
Codon
----ATG
Intron
Begin
----0
Fig. 13 – Image of the tRNA of the given sequence
There is one tRNA.
18
Bounds
End
---0
Cove
Score
-----20.56
’Ab initio’ protein-coding gene prediction
– Get gene 1, a genomic sequence from C.elegans, and compare the results of gene predictions obtained by different programs (pay attention to the format of the submitted
sequence) :
HMMgene
Netgene2
WebGene (Genebuilder) (option : ”First and last coding exons : disabled”)
Draw a shema describing the different predicted gene structures with the positions
(numbering) of the exon and intron boundaries.
Fig. 14 – Exon and intron boundaries with different tools on the C.elegans gene. The
mRNA (EST) of the same gene (Blastn from NCBI).
– Compare the results obtained by HMM if you choose ’human’ instead of C.elegans as
organism.
Why are they different ?
The boundaries of the found exons are the same on C.elegans but there are three
more on the complementary strand in the human : 1290 to 1418, 1461 to 1650 and
1443 to 2522.
19
Protein-coding gene prediction and the use of sequenced mRNAs
(ESTs)
– Do a Blastn search at NCBI with the genomic sequence (gene 1)
Select C. elegans ESTs (mRNAs).
How many different ”RNA” sequence(s) can you retrieve ?
There are 3 different RNA with principaly 4 exons.
– Retrieve the sequence of the mRNA (EST) BJ818152 in Fasta format.
Align this ESTs with the genomic sequence by using SIM4 (a alignment tool specific
for cDNA and genomic sequence alignment. SIM4 takes care of the intron/exon boundaries).
Compare the intron/exon boundaries numbering with the results obtained by the prediction programs (previous exercise).
>BJ818152 unpublished oligo-capped cDNA library, stage L4 Caenorhabditis
elegans cDNA clone yk1685h11 3’, mRNA sequence.
TAACGGGACCGAGAACGTTTATCGCTTTCCTCCGACACGTGGAGCAGCAGTCTTCACATT
CTTGGCGGTCTTTTGCTGGGTCTTTGGCTGAGAGGCCTTCTTTTCCTTGTTGGCAGCAGC
CTTGGCGGCACGGACAGCCTTGTTGGCATCCTTGGCGATCTTAGCGGCTTGTTCACGCTG
TTGGCGACGGAAGTCTTCGGTCTGGTTTCTCTTGGCAAGGATAGCATCAAGGGAAAGTCC
AGCGACGGCGCGGTTAACAACCTGGACGGACTTCTTGGTCTTCTTTCTGGTGACTTGCTC
TTGTCCGTGGGTTCCCTTCTTGTTCTTGATTCTGTAGAGGACAGTCCATCTGATGTCACG
TGGGTTACGGCGAAGCTTGGCTCCCTTGAGTGCCTTTCCACTGAGGAAGATTTGGACCTT
TCCGTCAGTACGGACAAGTCTCTTTCCGTGTCCTGGGTGGATCTTGTATCCGGAGTAAAC
GCAGGTTTCGACCTTCATTGTTGATAGGCCCTCGCTTGACGAATCTCAAACTTGGGTAAT
TAAACCTACAAATAAAAATGAGATAAAGCATACTGCCATTCTACAACCGGAGAATAAGAA
AACCGAAAACGAGAAAATTATTCTATTATGACAGATAGAATAAGTTAAAATGGGAAGAGT
GCATTTGTCACTGATTTACTTGGTGACTTGGTGGAGAGCGTGGGCAAGGTAAGCGACATT
GTTCGATGAA
The EST schema is on the previous figure.
– Do the same job with another EST (with different exon/intron boundaries if they exist).
The BJ775052 EST was chosen and the boundaries are about the same : from 969 to
1406, from 1452 to 1661 and 1914 to 2019.
20
Translation
– Translate the EST BJ818152 sequence by using one of the tools provided at ExPASy
(pay attention to the EST sequence orientation !)
Select several potential ORFs (open reading frame).
Using BlastP (@ExPASy) compare each potential ORF with already known C.elegans
protein sequences
Identify the correct protein sequence.
Try to find the function of the protein.
This is the selected ORF form the translation of the EST BJ818152 (3’5’ because
the EST was 3’) :
M
A
R
Q
T
K
L
K
Q
Q
V
K
K
R
Q
E
G
T
E
K
T
A
K
Q
T
C
K
K
A
A
V
L
S
A
K
Y
R
V
K
N
S
R
Q
I
V
G
N
V
A
K
Y
P
V
K
T
K
R
N
D
A
I
D
R
A
A
H P G H G K R L V R T D G K V Q I F L S G K
I R W T V L Y R I K N K K G T H G Q E Q V T
A V A G L S L D A I L A K R N Q T E D F R R
N K A V R A A K A A A N K E K K A S Q P K
P R V G G K R
Fig. 15 – The BlastP from the above ORF.
The found protein is the 60S ribosomal protein L24 from Caenorhabditis elegans. It’s
molecular function is a structural constituent of ribosome (uniprot entry : O01868)
– For fun : Translate directly the genomic sequence (gene 1) and try to find the correct
protein sequence.
It is impossible to translate the genomic sequence a to find a protein.
21
If you are still alive...
– ...try to find the correct protein sequence encoded by the following genomic sequence
from C. elegans (gene 2)
The protein corresponding to the gene 2 will be search.
First the exon on the gene 2 are located on the complementary strand on the following
positions : 789 to 1111, 1410 to 1636 and 1688 to 1845 (HMM). A blastn search is done
on the gene 2 (NCBI) and a sequence with similar exons is chosen.
>OSTR075F6_1 AD-wrmcDNA Caenorhabditis elegans cDNA, mRNA sequence
AATTTGCCCGGGTTCCTTCTTCAACGGATCCTCTTCCTCGTCCTTAACTCTTCTGATCTT
CTCCTGTTTTCGATACTTCGCCCGCCGATTCTGAAACCACACTTGAACTCGGGCTTCAGT
TAAATCAATTCTCATTGCAATTTCTTCTCGTGTATAAATATCTGGATAATGAGTTTCACA
GAATGATCTTTCCAACTCCTTCAGTTGTCCTGATGTGAATGTGGTACGGATTCGGCGTTG
TTTTCGACGCTCGGCAGGGTTCAAAGGAGCTCCACCGGTTGAGCAGAGAGCACCAACAAG
AGAACTTCTTGGCAGTCCGTTCAAAACATTGCTACTTGTCCTCTGAATCGTATCACTTCC
AATTAATTGTGATTTTTGATACAACTGATATTGTAGACCAGTATTAAAAAAAGCTTGTAG
TGAATCGTGTGTTGTATTACGGTAGTTTGATGAGGAAGATGATGAAGATGTGGAATTGCC
CGCTGAAGAGCTTGAAGTATTGTGAGCAGTTGTCAAGGCACGTCCACTTTGT
After that, this sequence is translated (NCBI Translat tool) and a ORF is chosen. The
chosen ORF is from 3’5’ because we search on the complementary strand :
M R I D L T E A R V Q V W F Q N R R A K Y R K Q E K I R R V K D E E E
D P L K K E P G Q I
Finally a BlastP is done. The protein is the homeobox protein unc-4. It is a transcription
factor (uniprot entry P29506) which could explain the little mRNA found in the blastn.
(( J’atteste que dans ce texte toute affirmation qui n’est pas le fruit de ma réflexion personnelle est attribuée à sa source et que tout passage recopié d’une autre source est en
outre placé entre guillemets. ))
Daniel Abegg
22
Download