EXERCISE 3 MULIPLE ALIGNMENT AND BLAST

advertisement
EXERCISE 3 MULIPLE ALIGNMENT AND BLAST
Suppose you have cloned and sequenced the following sequence in the lab. During the exercise
session we will try to study the function of this gene based on in silico searches. The only thing
we know is that we isolated this sequence from the bacterial species Paracoccus denitrificans
and that it is involved in respiration.
MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGARLIADASAEC
TPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGVALGVASLLAPGGNDQMGS
GVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLFKVPLFAWSVFITAWLILLSLPVLAG
AITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGF
VVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDR
VYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPV
EFAYWNNISSIGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH
1 convert file to FastA format
The FastA format is the standard format used by most sequence based programs (clustalW, …)
A sequence in FASTA format begins with a single-line description, followed by lines of sequence
data. The description line is distinguished from the sequence data by a greater-than (">") symbol in
the first column. It is recommended that all lines of text be shorter than 80 characters in length. An
example sequence in FASTA
format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRTQIWQKHRTSNDS
ALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLPKER
YRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPG
PCVQRTYVACHIRSVIIWLETISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRY
KLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNLL
AAVEAQQQMLKLTIWGVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid
codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a
single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid
sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical
digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g.,
N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:
A --> adenosine
M --> A C (amino)
C --> cytidine
S --> G C (strong)
G --> guanine
W --> A T (weak)
T --> thymidine
B --> G T C
U --> uridine
D --> G A T
R --> G A (purine)
H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto)
N --> A G C T (any)
- gap of indeterminate length
For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted
amino acid codes are:
A alanine
P proline
B aspartate or asparagine
Q glutamine
C cystine
R arginine
D aspartate
S serine
E glutamate
T threonine
F phenylalanine
U selenocysteine
G glycine
V valine
H histidine
W tryptophan
I isoleucine
Y tyrosine
K lysine
Z glutamate or glutamine
L leucine
X any
M methionine
* translation stop
N asparagine
- gap of indeterminate length
This is the sequence of the unknown gene
> unknown gene
MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGAR
LIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVCGV
ALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAPGMTLF
KVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVYIIILPGFGIIS
HVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVPTGIKVFSWIATMW
GGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGAVFGIFAGVYYWIGK
MSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISSIGAYISFASFLFFIGIVFY
TLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH
Homology search using Blast
http://www.ncbi.nlm.nih.gov/guide/homology/
Find homologs with your sequence of interest in the NCBI sequence database.
Blast your sequence.
Can you derive from the blasthits a clue on the function of your sequenced protein?
Go to the GenBank file of the best hit: (click on the link)
What information can you find in the GenBank file?
Do you find Matches with eukaryotic sequences? Are these Significant? What does that mean.
This system is interesting because it is very ancient and has been conserved throughout all
phylogenetic branches.
To find out which sequence residues are involved in the catalytic function, we will construct
an alignment of sequences from distinct species, so that we can have a representative
alignment of the family of terminal oxidases. From this alignment we will derive the residues
that are essential for the function because they have been conserved.
(perform a query against the reference database)
In what species do find most hits? Is this what you expect why?
Select from this file the following sequences in FastA format and add them to the clipboard
Select 3 sequences from these bacterial species: eg
Items 1 - 3 of 3
One page.
1.
cytochrome-c oxidase [Paracoccus denitrificans PD1222]
558 aa protein
YP_915727.1 GI:119384671
2.
cytochrome c oxidase, aa3-type, subunit I [Rhodobacterales bacterium HTCC2654]
557 aa protein
ZP_01014069.1 GI:84686174
3.
putative cytochrome C oxidase polypeptide I transmembrane protein [Sinorhizobium meliloti 1021]
562 aa protein
NP_385011.1 GI:15964658
Because the protein is so widely distributed and we found some hits in eukaryotes as well, we
will search for more eukaryotic hits. However, because the “unknown protein” is from
bacterial origin we will retrieve all prokaryotic sequences first because they will be most
similar. To focus on the eukaryotic hits, we will perform an advanced blastsearch and blast
the “unknown sequence” against the non redundant database from which the prokaryotic
sequences were excluded.
Redo the blast but now only search for mammalian hits
Redo the blast but know only search for mammalian hits
Add a few accession numbers but definitely select from the list
Add from the selected sequences from the blast output to the clipboard
1: ABR93038
Reports
BLink, Conserved Domains, Links
cytochrome c oxidase subunit I [Homo sapiens]
gi|151327759|gb|ABR93038.1|[151327759]
2: YP_001686700
Reports
BLink, Conserved Domains, Links
cytochrome c oxidase subunit I [Mus musculus musculus]
gi|167716839|ref|YP_001686700.1|[167716839]
Redo the blast but now only search for plant hits
viridaeplantae
3: NP_085587
Reports
BLink, Conserved Domains, Links
cytochrome c oxidase subunit 1 [Arabidopsis thaliana]
gi|13449404|ref|NP_085587.1|[13449404]
4: YP_514675
Reports
BLink, Conserved Domains, Links
cytochrome c oxidase subunit 1 [Oryza sativa Indica Group]
gi|89280750|ref|YP_514675.1|[89280750]
Add the following sequences by searching for the following accessionnumbers in GenBank, Add
the FastA files to the clipboard.
YP_428413
NP_769403
YP_001188087
Download the corresponding protein sequences in FastA format
>gi|151327759|gb|ABR93038.1| cytochrome c oxidase subunit I [Homo sapiens]
MFADRWLFSTNHKDIGTLYLLFGAWAGVLGTALSLLIRAELGQPGNLLGNDHIYNVIVTAHAFVMIFFMV
MPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASAMVEAGAGTGWTVYPPLAGNYSHPG
ASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLILSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGSNMKWSAAVLWALGFIFLFTV
GGLTGIVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFIG
VNLTFFPQHFLGLSGMPRRYSDYPDAYTTWNILSSVGSFISLTAVMLMIFMIWEAFASKRKVLMVEEPSM
NLEWLYGCPPPYHTFEEPVYMKS
>gi|167716839|ref|YP_001686700.1| cytochrome c oxidase subunit I [Mus musculus
musculus]
MFINRWLFSTNHKDIGTLYLLFGAWAGMVGTALSILIRAELGQPGTLLGDDQIYNVIVTAHAFVMIFFMV
MPMMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSFLLLLASSMVEAGAGTGWTVYPPLAGNLAHAG
ASVDLTIFSLHLAGVSSILGAINFITTIINMKPPAMTQYQTPLFVWSVLITAVLLLLSLPVLAAGITMLL
TDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHVVTYYSGKKEPFGYMGMVWAMMSI
GFLGFIVWAHHMFTVGLDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGGNIKWSPAMLWALGFIFLFTV
GGLTGIVLSNSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMAGFVHWFPLFSGFTLDDTWAKAHFAIMFVG
VNMTFFPQHFLGLSGMPRRYSDYPDAYTTWNTVSSMGSFISLTAVLIMIFMIWEAFASKREVMSVSYAST
NLEWLHGCPPPYHTFEEPTYVKVK
>gi|13449404|ref|NP_085587.1| cytochrome c oxidase subunit 1 [Arabidopsis
thaliana]
MKNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIF
FMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITS
HSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAIT
MLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSGKPVFGYLGMVYAMI
SIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLF
TIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITF
FGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGICCFFVVVTITLSSGNNKRCAPSP
WALELNSTTLEWMVQSPPAFHTFGELPAIKETKSYVK
>gi|89280750|ref|YP_514675.1| cytochrome c oxidase subunit 1 [Oryza sativa
Indica Group]
MTNLVRWLFSTNHKDIGTLYFIFGAIAGVMGTCFSVLIRMELARPGDQILGGNHQLYNVLITAHAFLMIF
FMVMPAMIGGFGNWFVPILIGAPDMAFPRLNNISFWLLPPSLLLLLSSALVEVGSGTGWTVYPPLSGITS
HSGGAVDLAIFSLHLSGVSSILGSINFITTIFNMRGPGMTMHRLPLFVWSVLVTAFLLLLSLPVLAGAIT
MLLTDRNFNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGIISHIVSTFSRKPVFGYLGMVYAMI
SIGVLGFLVWAHHMFTVGLDVDTRAYFTAATMIIAVPTGIKIFSWIATMWGGSIQYKTPMLFAVGFIFLF
TIGGLTGIVLANSGLDIALHDTYYVVAHFHYVLSMGAVFALFAGFYYWVGKIFGRTYPETLGQIHFWITF
FGVNLTFFPMHFLGLSGMPRRIPDYPDAYAGWNALSSFGSYISVVGIRRFFVVVAITSSSGKNKRCAESP
WAVEQNPTTLEWLVQSPPAFHTFGELPAIKETKS
>gi|69934593|ref|ZP_00629671.1| Cytochrome-c oxidase [Paracoccus denitrificans
PD1222]
MADAAVHGHGDHHDTRGFFTRWFMSTNHKDIGILYLFTAGIVGLISVCFTVYMRMELQHPGVQYMCLEGA
RLIADASAECTPNGHLWNVMITYHGVLMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRLNNLSYWMYVC
GVALGVASLLAPGGNDQMGSGVGWVLYPPLSTTEAGYSMDLAIFAVHVSGASSILGAINIITTFLNMRAP
GMTLFKVPLFAWSVFITAWLILLSLPVLAGAITMLLMDRNFGTQFFDPAGGGDPVLYQHILWFFGHPEVY
IIILPGFGIISHVISTFAKKPIFGYLPMVLAMAAIGILGFVVWAHHMYTAGMSLTQQAYFMLATMTIAVP
TGIKVFSWIATMWGGSIEFKTPMLWAFGFLFLFTVGGVTGVVLSQAPLDRVYHDTYYVVAHFHYVMSLGA
VFGIFAGVYYWIGKMSGRQYPEWAGQLHFWMMFIGSNLIFFPQHFLGRQGMPRRYIDYPVEFAYWNNISS
IGAYISFASFLFFIGIVFYTLFAGKRVNVPNYWNEHADTLEWTLPSPPPEHTFETLPKREDWDRAHAH
>gi|146276716|ref|YP_001166875.1| Cytochrome-c oxidase [Rhodobacter sphaeroides
ATCC 17025]
MADAAIHGHEHDRRGFFTRWFMSTNHKDIGVLYLFTGGLVGLISVAFTVYMRMELMAPGVQFMCAEHLES
GLVKGFFQSLWPSAVENCTPNGHLWNVMITAHGILMMFFVVIPALFGGFGNYFMPLHIGAPDMAFPRMNN
LSFWLYVAGTSLAVASLFAPGGNGQLGSGIGWVLYPPLSTSESGYSTDLAIFAVHLSGASSILGAINMIT
TFLNMRAPGMTMHKVPLFAWSIFVTAWLILLALPVLAGAITMLLTDRNFGTTFFQPSGGGDPVLYQHILW
FFGHPEVYIIVLPAFGIVSHVIATFSKKPIFGYLPMVYAMVAIGVLGFVVWAHHMYTAGLSLTQQSYFMM
ATMVIAVPTGIKIFSWIATMWGGSIELKTPMLWALGFLFLFTVGGVTGIVLSQASVDRYYHDTYYVVAHF
HYVMSLGAVFGIFAGIYFWIGKMSGRQYPEWAGKLHFWMMFVGANLTFFPQHFLGRQGMPRRYIDYPEAF
ATWNFVSSLGAFLSFASFLFFIGIVFYTLTRGARVTANNYWNEHADTLEWTLTSPPPEHTFERLPKREDW
DRSHAH
>gi|27376282|ref|NP_767811.1| cytochrome C oxidase subunit I [Bradyrhizobium
japonicum USDA 110]
MATSAAAHGDHAQDHGHDEHAHPTGWRRYVYSTNHKDIGTMYLIFAVIAGVIGAAMSIAIRAELMYPGVQ
IFHETHTYNVFVTSHGLIMIFFMVMPAMIGGFGNWFVPLMIGAPDMAFPRMNNISFWLLPASFGLLLMST
FVEGEPGANGVGAGWTMYVPLSSSGHPGPAVDFAILSLHLAGASSILGAINFITTIFNMRAPGMTLHKMP
LFVWSILVTVFLLLLSLPVLAGAITMLLTDRNFGTTFFAPDGGGDPVLFQHLFWFFGHPEVYILILPGFG
MISQIVSTFSRKPVFGYLGMAYAMVAIGGIGFVVWAHHMYTVGMSSATQAYFVAATMVIAVPTGVKIFSW
IATMWGGSIEFRAPMIWAVGFIFLFTVGGVTGVVLANAGVDRVLQETYYVVAHFHYVLSLGAVFAIFAGW
YYWFPKMTGYMYNETLAKAHFWVTFIGVNLVFFPQHFLGLSGMPRRYVDYPDAFAGWNLVSSVGSYISGF
GVLIFLYCVIDAFAKKVPAGDNPWGAGATTLEWTLPSPPPFHQFEVLPRVQ
>gi|83594661|ref|YP_428413.1| Cytochrome c oxidase cbb3-type, subunit I
[Rhodospirillum rubrum ATCC 11170]
MTQATIARGARADTEPYVEGVIKKFVIAAVLWGVVGFIAGDVIAWQLAFPSLNMDLEWTSFGRLRPLHTS
AVVFAFGGNVLLGTSLYVVQRTSRASLYGGAALGNIIFWGYQLFIVMAALGYVLGVTQGKEYAEPEWFVD
LFLTVVWVLYLAAFLGTLLKRREPHIYVANWFFLAMIITIALLHLGNNMAIPVALMGGDSWVKSYGFYSG
VQDAMTQWWYGHNAVGFFLTAGFLGIMYYFVPKQAQRPVYSYRLSIVHFWALIFLYIWAGPHHLHYTALP
DWAQTVGMVFSVMLWMPSWGGMINGLMTLSGAWDKLRTDPVLRFLVVSVGFYGMSTFEGPMMSIKAVNSL
SHYTDWTIGHVHSGALGWVAFVSFGALYYLVPALWKRRSLYSLKLVSLHFWIATLGIVLYITSMWVSGIM
QGLMWRAYDELGFLQYSFIESVAAMHPYYIIRATGGVLFVIGSVVMVYNMYRTIKGDIREDAPQAAYLGS
AAGVRR
>gi|27377874|ref|NP_769403.1| cytochrome-c oxidase [Bradyrhizobium japonicum
USDA 110]
MSQPSISKSMTIGESGLAVVFAATAFLCVIAAAKALDAPFAFHAALSAAASVAAVFCIVNRYFERPAALP
PAEINGRPNYNMGPIKFSSFMAMFWGIAGFLVGLIIASQLAWPALNFDLPWISFGRLRPLHTSAVIFAFG
GNVLIATSFYVVQKSCRVRLAGDLAPWFVVVGYNFFILVAGTGYLLGVTQSKEYAEPEWYADLWLTIVWV
VYLLVFLATIIKRKEPHIFVANWFYLAFIVTIAVLHLGNNPALPVSAFGSKSYVAWGGIQDAMFQWWYGH
NAVGFFLTAGFLAIMYYFIPKRAERPIYSYRLSIIHFWALIFLYIWAGPHHLHYTALPDWTQTLGMTFSI
MLWMPSWGGMINGLMTLSGAWDKLRTDPVLRMLVVSVAFYGMSTFEGPMMSIKVVNSLSHYTDWTIGHVH
SGALGWVGFVSFGALYCLVPWAWNRKGLYSLKLVNWHFWVATLGIVLYISAMWVSGILQGLMWRAYTSLG
FLEYSFIETVEAMHPFYIIRAAGGGLFLIGALIMAYNLWMTVRVGEAEVQMPVALQPAE
>gi|146307622|ref|YP_001188087.1| cytochrome c oxidase, cbb3-type, subunit I
[Pseudomonas mendocina ymp]
MNTTTRSAYNYRVVRQFAIMTVVWGIVGMGLGVFIAAQLAWPDLNFNLPWTSFGRLRPLHTNAVIFAFGG
CALFATSYYAVQRTSQTTLFAPKLAAFTFWGWQLVIVLAAISLPLGWTSSKEYAELEWPIDILITIVWVS
YAIVFFGTVMQRKVSHIYVGNWFFGGFILTVAILHVVNNLEIPITLTKSYSLYAGATDAMIQWWYGHNAV
GFFLTAGFLGMMYYFVPKQAGRPVYSYRLSIVHFWALIAVYIWAGPHHLHYTALPDWAQSLGMVMSLVLL
APSWGGMINGMMTLSGAWHKLRTDPILRFLVVSLAFYGMSTFEGPMMAIKTVNALSHYTDWTIGHVHAGA
LGWVAMVSIGSLYHLIPKVFGREQMHSIGLINSHFWLATIGTVLYIASMWVNGITQGLMWRAVNEDGTLT
YSFVEALEASHPGFVVRVIGGAIFFAGMLLMAWNVWLTVRSAKSTEMEAAAQFSVEGAH
Multiple sequence alignment
We will use the sequences selected above as input in a multiple sequence alignment program
(ClustalW).
 Download the program from
http://inn-prot.weizmann.ac.il/software/ClustalX.html
Information on the program can be found at:
http://www.molbiol.ox.ac.uk/documentation/clustalx/clustalx.html
 Use the webinterface
http://www.ebi.ac.uk/clustalw/
Align globally the sequences that you have selected.
Test the influence of different parameters. Once you have obtained a reliable alignment:
1) Look at the alignment and the phylogenetic tree? What do you observe?
2) Compare the multiple alignment with the local pairwise alignments of two members of the
family (e.g. dataset2). What is most informative the pairwise or the multiple alignment?
3) The sequences that you have added and that were not retrieved by the blast search? Do they
belong in the multiple alignment? Why were they not detected by the blast search?
4) Change the parameters of the multiple alignment (different gap cost, lower), do you still find
the right alignment? How sensitive is the finding the true alignment to the gap cost
Results
The system that will be studied is the terminal oxidase, the enzyme that catalyses the final reduction
of O2 to H2O in the respiratory chain to generate energy. For more information see
http://www.sanger.ac.uk/Software/Pfam/
The genes of which the sequences were aligned constitute Subunit 1 of the terminal oxidase
complex. This subunit contains the catalytic site where O2 is reduced to H2O. It contains to this
purpose a heme copper center. A conserved high spin and low spin heme are involved in ligating
the Cu center. Conserved H residues have been shown to be involved in binding the heme. Three
major families of terminal oxidases can be detected, 2 of which occur in prokaryotes only (cytcb3
type oxidases and quinol oxidases. The terminal oxidase all are part of the respiratory chain, they
receive electrons from an electron donor and use these electrons to reduce the O2. Some terminal
oxidases receive the elcttrons from quinols, other from cytochrome c. This explains the differences
in sequence between the distinct classes of oxidases.
Eukaryotes only have one type of oxidase: cytochrome c type oxidase:
Prokaryotes often have branched repiratory chains with different type of terminal oxidases. Each of
these oxidases has different properties (some are produced at very low O2 concentrations and have
a high affinity for O2). This allows bacteria to live in very different environments while for
eukaryotes a fixed O2 concentration is required.
The cytochrome c type oxidases is the one that is also present in eukaryotes.
For comparison the correct alignments with an indication on the structural important sites
Multiple alignment: standard gap cost
Ligands Cu center
Ligands hemes
Download