FASTA

advertisement
Sequence alignment and sequence databank search exercises
Biomolecular sequence alignment and sequence databank search
by using publicly available Web tools
Help and Tutorial Web pages for the use of all the below mentioned tools are available at each
tool Web site. Please go through such tutorials before carrying on the following exercises.
I. Needleman-Wunsch
Global alignment of two sequences by using the following implementations:
•
Needle - EMBOSS Pairwise Alignment (European Bioinformatics Institute)
(http://www.ebi.ac.uk/Tools/emboss/align/)
•
Needle - at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=needle)
II. Smith-Waterman
Local alignment of two sequences by using the following implementations:
•
MPsrch (http://www.ebi.ac.uk/MPsrch/)
•
Scanps2.3 (http://www.ebi.ac.uk/scanps/)
•
SSEARCH - Protein Similarity Search (http://www.ebi.ac.uk/Tools/fasta33/)
•
Water - at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=water)
III. FASTA
Sequence similarity searching by using the implementations available:
•
FASTA - at the FASTA Server of the University of Virginia
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_class.shtml)
•
FASTA - at EBI (http://www.ebi.ac.uk/Tools/fasta/)
IV. BLAST
Sequence local similarity searching by using the implementations available:
•
BLAST - at EBI (http://www.ebi.ac.uk/Tools/blast2/)
Marco Masseroli, PhD
1
Sequence alignment and sequence databank search exercises
•
BLAST - at NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi)
•
BLAST - at University of Virginia
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=blast)
•
Blast2Seq (http://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi)
V. ClustalW
Multiple alignments of several sequences by using the following implementations:
•
ClustalW - at EBI (http://www.ebi.ac.uk/Tools/clustalw2/)
•
ClustalW - at EMBNET (http://www.ch.embnet.org/software/ClustalW.html)
Align some of the homolog sequences found by using the above algorithm implementations in
order to find conserved regions within them.
VI. GENSCAN
•
•
GENSCAN (http://genes.mit.edu/GENSCAN.html)
GENSCAN at Institut Pasteur (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=genscan)
Practical:
1. Of the following two sequences Sequence 1: MTPARGSALS Sequence 2: MTPVRRSLS
by using the grid method and the BLOSUM62 substitution matrix calculate by hand the
optimal global alignment with a linear gap penalty of -1, and the optimal local alignment
with a linear gap penalty of -2 and a threshold of 28.
o Verify the obtained global alignment and its score by using the EBI EMBOSS Needle
tool (http://www.ebi.ac.uk/Tools/emboss/align/) or the Needle tool at Pasteur Institute
(http://mobyle.pasteur.fr/cgi-bin/portal.py?form=needle) that implement the
Needleman-Wunsch global alignment algorithm.
•
Does the result change by using the BLOSUM40 as substitution matrix?
•
Does the result change by using different gap penalty (e.g. -2, -3, -5)?
•
See what happen to the result by using the BLOSUM62 as substitution matrix
and the same different gap penalty (e.g. -2, -3, -5).
Marco Masseroli, PhD
2
Sequence alignment and sequence databank search exercises
o Verify the obtained local alignment and its score by using the Water tool at Pasteur
Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=water) that implements the
Smith-Waterman local alignment algorithm.
•
Does the result change by using the PAM120 as substitution matrix?
•
Does the result change by using different gap penalty (e.g. -1, -3, -5)?
•
See what happen to the result by using the BLOSUM62 as substitution matrix
and the same different gap penalty (e.g. -1, -3, -5).
2. By using the implementations available at the FASTA Server of the University of
Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_class.shtml) perform the
following set of exercises that illustrate important principles in sequence similarity
searching: I) the effect of scoring matrices, gap penalties and algorithms (similarity in the
results produced by FASTA, BLAST, GGSEARCH (Needleman-Wunsch) and
SSEARCH (Smith-Waterman)); II) detection of local duplications from significant
similarity.
Most of the searches in these exercises should be done against a small protein database,
e.g. the PIR1 database available at the FASTA web site of the University of Virginia.
Searching a small database makes it practical to consider each of the high scoring
similarities, and to evaluate further whether they are likely to be biologically meaningful.
I.
Effects of scoring matrices, gap penalties and algorithms
o Compare Drosophila glutathione transferase - GSTT1_DROME (gi|121694)
(http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=121694&
dopt=fasta) to the PIR1 Annotated protein sequence database using the FASTA
protein:protein program with its default parameters at the FASTA search page
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=fa&
query=121694&db=p)
•
Look at E-values, % identities and % similarities; how they vary / are
correlated?
•
Examine how the expectation value changes with different scoring matrices
(BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default
scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of
-10 to open a gap and -2 for each residue in the gap, e.g. -12 for a one residue
gap).
• How does the distribution of scores and the E-values of the highest scoring
sequences vary when the gap-open/gap-ext penalties are smaller (e.g. -7/1)?
Marco Masseroli, PhD
3
Sequence alignment and sequence databank search exercises
•
Perform the same search with the ggsearch program (Needleman-Wunsch)
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=
gnw&query=gstt1_drome&db=p) and ssearch program (Smith-Waterman)
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=s
w&query=gstt1_drome&db=p) with their default parameters.
- Look at the best similar sequences found and their E-values; how do they
vary with respect to those from the FASTA algorithm?
•
Search with ktup=1 (What is ktup?), look at the results and compare them to
the results obtained with ktup=2; how do the search results and speed change?
o Do the same search above (Drosophila glutathione transferase - GSTT1_DROME
(gi|121694)) using the BLAST web page at the University of Virginia
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=blast).
•
How do the blastp E-values compare with the FASTA (blosum62) E-values?
o Perform analyses similar to the above ones by using the FASTA implementation at
http://www.ebi.ac.uk/Tools/fasta/.
II.
Significant similarities within sequences (e.g. domain duplication); exploring
domains with local alignments
o Examine local similarities between human calmodulin CALM_HUMAN and itself
using lalign program with its default parameters
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=lal&
query=calm_human&q_type=acc&query2=calm_human&q2_type=acc).
o Plot the same alignment using plalign program with its default parameters
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=pal
&query=calm_human&q_type=acc&query2=calm_human&q2_type=acc).
•
How many repeats are present in this sequence?
o What does it happen to the domain alignment plot when you use a shallower
scoring matrix (e.g. MD20)?
o Examine local similarities between human Death Associated Protein Kinase 1
(DAPK1) DAPK1_HUMAN and itself, using lalign program with its default parameters
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=lal&
query=dapk1_human&q_type=acc&query2=dapk1_human&q2_type=acc).
o Plot the same alignment using the plalign program with its default parameters
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=pal
&query=dapk1_human&q_type=acc&query2=dapk1_human&q2_type=acc).
Marco Masseroli, PhD
4
Sequence alignment and sequence databank search exercises
•
How many repeats are present in this sequence? Try zooming in by doing the
alignment plot using the subset of the sequence from 350-650
(http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=
pal&query=dapk1_human&q_type=acc&ssr=350650&query2=dapk1_human&q2_type=acc&ssr2=350-650).
o What does it happen to the domain alignment plot when you use a shallower
scoring matrix (e.g. MD20)?
o Look at the PFAM annotation of this protein at: DAPK1_HUMAN Pfam
(http://pfam.janelia.org/protein?entry=p53355).
•
How many repeats are present? Of which domain?
o For more complex domain alignments, try mouse RNA polymerase (rpb1_mouse,
residues 1500-) against itself, using the BLOSUM50, as well as MD20, scoring
matrixes.
•
Check the result by looking at the PFAM annotation of this protein
(http://pfam.janelia.org//protein/rpb1_mouse). How many repeats are present
in the searched region? Of which domain?
3. With the FASTA protein sequence below perform a (protein) BLAST search
(http://blast.ncbi.nlm.nih.gov/Blast.cgi) in the NCBI Non-redundant protein sequences
(nr) databank using the blastp program with its default parameters.
>query sequence
MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGC
LSPSCLSLLHLQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIAD
MLLGFLVMPVSMLTILYGYRWPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNP
IHHSRFNSRTKAFLKIIAVWTISVGISMPIPVFGLQDDSKVFKEGSCLLADDNFVLIGSF
VSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLASFSFLPQSSLSSEKLFQRSIH
REPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEDVIGA
LLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYK
SSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV
o How many similar sequences are found?
o What information can be extracted from the graph in the result page?
o What are the most similar sequences found?
o What is the query sequence? According to what do you identify the query sequence?
o See what happen if you perform the same BLAST search using as query sequence
only a fragment of the previous query sequence.
Marco Masseroli, PhD
5
Sequence alignment and sequence databank search exercises
4. Perform (nucleotide) BLAST searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi) of the
nucleotide sequence with accession number AB037513 using blastn and tblastx programs
against the Reference genomic sequences (refseq_genomic) database, limiting the
searches to the organism Drosophila melanogaster (tax_id: 7227) and with their other
default parameters.
o What does encode the sequence with accession number AB037513?
o How many similar sequences are found in each case (blastn or tblastx searches)?
o What can be stated regarding the quality of hits?
o What are the differences between the two program blastn and tblastx, and how do the
respective search results originate?
5. Use the BLAST implementations from EBI (http://www.ebi.ac.uk/Tools/blast2/) and
NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi), with their default parameters, to search for
the two FASTA test sequences below.
>test sequence 1
CGCGGCCGCGGTTCGCTGTGGCGGGCGCCTGGGCCGCCGGCTGTTTAACTTCGCTTCCGC
TGGCCCATAGTGATCTTTGCAGTGACCCAGCAGCATCACTGTTTCTTGGCGTGTGAAGAT
AACCCAAGGAATTGAGGAAGTTGCTGAGAAGAGTGTGCTGGAGATGCTCTAGGAAAAAAT
TGAATAGTGAGACGAGTTCCAGCGCAAGGGTTTCTGGTTTGCCAAGAAGAAAGTGAACAT
CATGGATCAGAACAACAGCCTGCCACCTTACGCTCAGGGCTTGGCCTCCCCTCAGGGTGC
CATGACTCCCGGAATCCCTATCTTTAGTCCAATGATGCCTTATGGCACTGGACTGACCCC
ACAGCCTATTCAGAACACCAATAGTCTGTCTATTTTGGAAGAGCAACAAAGGCAGCAGCA
GCAACAACAACAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAACAGCAGCAGCAGCA
GCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGGCAGTGGCAGCTGC
AGCCGTTCAGCAGTCAACGTCCCAGCAGGCAACACAGGGAACCTCAGGCCAGGCACCACA
GCTCTTCCACTCACAGACTCTCACAACTGCACCCTTGCCGGGCACCACTCCACTGTATCC
CTCCCCCATGACTCCCATGACCCCCATCACTCCTGCCACGCCAGCTTCGGAGAGTTCTGG
GATTGTACCGCAGCTGCAAAATATTGTATCCACAGTGAATCTTGGTTGTAAACTTGACCT
AAAGACCATTGCACTTCGTGCCCGAAACGCCGAATATAATCCCAAGCGGTTTGCTGCGGT
AATCATGAGGATAAGAGAGCCACGAACCACGGCACTGATTTTCAGTTCTGGGAAAATGGT
GTGCACAGGAGCCAAGAGTGAAGAACAGTCCAGACTGGCAGCAAGAAAATATGCTAGAGT
TGTACAGAAGTTGGGTTTTCCAGCTAAGTTCTTGGACTTCAAGATTCAGAACATGGTGGG
GAGCTGTGATGTGAAGTTTCCTATAAGGTTAGAAGGCCTTGTGCTCACCCACCAACAATT
TAGTAGTTATGAGCCAGAGTTATTTCCTGGTTTAATCTACAGAATGATCAAACCCAGAAT
TGTTCTCCTTATTTTTGTTTCTGGAAAAGTTGTATTAACAGGTGCTAAAGTCAGAGCAGA
AATTTATGAAGCATTTGAAAACATCTACCCTATTCTAAAGGGATTCAGGAAGACGACGTA
ATGGCTCTCATGTACCCTTGCCTCCCCCACCCCCTTCTTTTTTTTTTTTTAAACAAATCA
GTTTGTTTTGGTACCTTTAAATGGTGGTGTTGTGAGAAGATGGATGTTGAGTTGCAGGGT
GTGGCACCAGGTGATGCCCTTCTGTAAGTGCCCACCGCGGGATGCCGGGAAGGGGCATTA
TTTGTGCACTGAGAACACCGCGCAGCGTGACTGTGAGTTGCTCATACCGTGCTGCTATCT
GGGCAGCGCTGCCCATTTATTTATATGTAGATTTTAAACACTGCTGTTGACAAGTTGGTT
TGAGGGAGAAAACTTTAAGTGTTAAAGCCACCTCTATAATTGATTGGACTTTTTAATTTT
AATGTTTTTCCCCATGAACCACAGTTTTTATATTTCTACCAGAAAAGTAAAAATCTTTTT
TAAAAGTGTTGTTTTTCTAATTTATAACTCCTAGGGGTTATTTCTGTGCCAGACACATTC
CACCTCTCCAGTATTGCAGGACGGAATATATGTGTTAATGAAAATGAATGGCTGTACATA
TTTTTTTCTTTCTTCAGAGTACTCTGTACAATAAATGCAGTTTATAAAAGTGTTAAAAAA
AAAAAAAAAAAAAAAA
Marco Masseroli, PhD
6
Sequence alignment and sequence databank search exercises
>test sequence 2
MDQNNSLPPYAQGLASPQGAMTPGIPIFSPMMPYGTGLTPQPIQNTNSLSILEEQQRQQQ
QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQAVAAAAVQQSTSQQATQGTSGQAPQ
LFHSQTLTTAPLPGTTPLYPSPMTPMTPITPATPASESSGIVPQLQNIVSTVNLGCKLDL
KTIALRARNAEYNPKRFAAVIMRIREPRTTALIFSSGKMVCTGAKSEEQSRLAARKYARV
VQKLGFPAKFLDFKIQNMVGSCDVKFPIRLEGLVLTHQQFSSYEPELFPGLIYRMIKPRI
VLLIFVSGKVVLTGAKVRAEIYEAFENIYPILKGFRKTT
o What the two test sequences are?
o Which their homolog (paralog and ortholog) sequences are?
6. Perform a blast2sequences alignment (http://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi)
between proteins P28223 and Q24511 by using the blastp program and its default
parameters.
o What are the two proteins?
o How can the result be interpreted?
7. Perform a multiple alignment of the three sequences below with CLUSTALW
(http://www.ebi.ac.uk/Tools/clustalw2/) and with its default parameters.
In the result Web page, at the end of the Alignment section, click the “Show Colors”
button to display color coded properties of the amino acids in the sequences.
>gi|543727|sp|P28223.2|5HT2A_HUMAN RecName: Full=5-hydroxytryptamine
receptor 2A; Short=5-HT-2A; Short=5-HT-2; AltName: Full=Serotonin
receptor 2A
MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGCLSPSCLSLLH
LQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYR
WPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGISMPI
PVFGLQDDSKVFKEGSCLLADDNFVLIGSFVSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLAS
FSFLPQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICK
ESCNEDVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYK
SSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV
>gi|7296517|gb|AAF51802.1| CG7485-PA [Drosophila melanogaster]
MPSADQILFVNVTTTVAAAALTAAAAVSTTKSGSGNAARGYTDSDDDAGMGTEAVANISGSLVEGLTTVT
AALSTAQADKDSAGECEGAVEELHASILGLQLAVPEWEALLTALVLSVIIVLTIIGNILVILSVFTYKPL
RIVQNFFIVSLAVADLTVALLVLPFNVAYSILGRWEFGIHLCKLWLTCDVLCCTSSILNLCAIALDRYWA
ITDPINYAQKRTVGRVLLLISGVWLLSLLISSPPLIGWNDWPDEFTSATPCELTSQRGYVIYSSLGSFFI
PLAIMTIVYIEIFVATRRRLRERARANKLNTIALKSTELEPMANSSPVAASNSGSKSRLLASWLCCGRDR
AQFATPMIQNDQESISSETHQPQDSSKAGPHGNSDPQQQHVVVLVKKSRRAKTKDSIKHGKTRGGRKSQS
SSTCEPHGEQQLLPAGGDGGSCQPGGGHSGGGKSDAEISTESGSDPKGCIQVCVTQADEQTSLKLTPPQS
STGVAAVSVTPLQKKTSGVNQFIEEKQKISLSKERRAARTLGIIMGVFVICWLPFFLMYVILPFCQTCCP
TNKFKNFITWLGYINSGLNPVIYTIFNLDYRRAFKRLLGLN
>gi|10726392|gb|AAF54255.2| CG8007-PA, isoform A [Drosophila
melanogaster]
MKNRFFFSFTGYFPLGSEHCLTWICLDVLFCTASIMHLCTISVDRYLSLRYPMRFGRNKTRRRVTLKIVF
Marco Masseroli, PhD
7
Sequence alignment and sequence databank search exercises
VWLLSIAMSLPLSLMYSKNHASVLVNGTCQIPDPVYKLVGSIVCFYIPLGVMLLTYCLTVRLLARQRQNL
GGGQQTAAATPGWASGWLGQAPALGGPAGAAPATVGSTLSPHSAHNQGSSLTVQSDGSGYLAAPGTPCPG
RRKLSISKTASVVTWDSSRHRRRGSSFGGVRTSLLLTPTKTATTSNSSTPLRRSATLRSHQNMNYQGAGE
CGGKTRTTTSSPCMLQRQQTVRSHHSRNSSVISRNSSRHGRIIRLEQKATKVLGVVFFTFVILWSPFFVL
NLLPTVCAECEERISHWVFDVVTWLGYASSMVNPIFYTIFNKVFRQAFKKVLLCRYSSTSAWRPSR
o How can the result be interpreted?
8. Perform a multiple alignment of the six sequences below gi|19424144, gi|21245114,
gi|2499874, gi|4503155, gi|1705638, and gi|15214962 with CLUSTALW
(http://www.ebi.ac.uk/Tools/clustalw2/) and with its default parameters.
In the result Web page, at the end of the Alignment section, click the “Show Colors”
button to display color coded properties of the amino acids in the sequences.
Then, calculate a phylogenetic tree for the six proteins by clicking the “Show as
Phylogram Tree” button at the beginning of the Guide Tree section in the result Web
page.
>gi|19424144|ref|NP_081182.2| cathepsin 3 precursor [Mus musculus]
MTPAVFLAILWLVMASSSPSPDPILDAEWQKWKIKYGKTYSLEEEGQKRAVWEENMKKIKLHNGENGLGK
HGFTMEMNAFGDMTLEEFRKEMIEIPVPTVKKGKSVQKRLSVNLPKFINWKKRGYVTPVRTQIACNSCWA
ISVTGAIEGQMFRKTGQLIPLSVQNLVDCVDGSGCHAGSVLDSFKYLMEKGGLESEATYPYEDKQGSCRY
NPENSTASITGFEFIPNNEVDLMSAVASLGPISVVIDAWHESFLFYKRGIYYEPNCNNSLFALRHAVLLV
GYGFIGRESEGRKYWIIKNSLGTKWGYKGYMKIAKDQGNHCGIASLPVFPRV
>gi|21245114|ref|NP_640355.1| cathepsin Q [Rattus norvegicus]
MTPAVFLVILCLGVVPGASALDLSLDVQWQEWKIKYEKLYSPEEEVLKRVVWEENVKKIELHNRENSLGK
NTYTMEINDFADMTDEEFKDMIIGFQLPVHNTEKRLWKRALGSFFPNSWNWRDALPKFVDWRNEGYVTRV
RKQGGCSSCWAFPVTGAIEGQMFKKTGKLIPLSVQNLIDCSKPQGNRGCLWGNTYNAFQYVLHNGGLEAE
ATYPYERKEGVCRYNPKNSSAKITGFVVLPESEDVLMDAVATKGPIATGVHVISSSFRFYQKGVYHEPKC
SSYVNHAVLVVGYGFEGNETDGNNYWLIKNSWGKRWGLRGYMKIAKDRNNHCAIASLAQYPTV
>gi|2499874|sp|Q28944.1|CATL1_PIG RecName: Full=Cathepsin L1; Contains:
RecName: Full=Cathepsin L1 heavy chain; Contains: RecName:
Full=Cathepsin L1 light chain; Flags: Precursor
MKPSLFLTALCLGIASAAPKLDQNLDADWYKWKATHGRLYGMNEEGWRRAVWEKNMKMIELHNQEYSQGK
HGFSMAMNAFGDMTNEEFRQVMNGFQNQKHKKGKVFHESLVLEVPKSVDWREKGYVTAVKNQGQCGSCWA
FSATGALEGQMFRKTGKLVSLSEQNLVDCSRPQGNQGCNGGLMDNAFQYVKDNGGLDTEESYPYLGRETN
SCTYKPECSAANDTGFVDIPQREKALMKAVATVGPISVAIDAGHSSFQFYKSGIYYDPDCSSKDLDHGVL
VVGYGFEGTDSNSSKFWIVKNSWGPEWGWNGYVKMAKDQNNHCGISTAASYPTV
>gi|4503155|ref|NP_001903.1| cathepsin L1 preproprotein [Homo sapiens]
MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNMKMIELHNQEYREGK
HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA
FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES
CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV
VGYGFESTESDNNKYWLVKNSWGEEWGMGGYVKMAKDRRNHCGIASAASYPTV
>gi|1705638|sp|P25975|CATL_BOVIN Cathepsin L precursor [Contains:
Cathepsin L heavy chain; Cathepsin L light chain]
MNPSFFLTVLCLGVASAAPKLDPNLDAHWHQWKATHRRLYGMNEEEWRRAVWEKNKKIIDLHNQEYSEGK
HAFRMAMNAFGDMTNEEFRQVMNGFQNQKHKKGKLFHEPLLVDVPKSVDWTKKGYVTPVKNQGQCGSCWA
FSATGALEGQMFRKTGKLVSLSEQNLVDCSRAQGNQGCNGGLMDNAFQYIKDNGGLDSEESYPYLATDTN
Marco Masseroli, PhD
8
Sequence alignment and sequence databank search exercises
SCNYKPECSAANDTGFVDIPQREKALMKAVATVGPISVAIDAGHTSFQFYKSGIYYDPDCSCKDLDHGVL
VVGYGFEGTDSNNNKFWIVKNSWGPEWGWNGYVKMAKDQNNHCGIATAASYPTV
>gi|15214962|gb|AAH12612.1| Cathepsin L1 [Homo sapiens]
MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNVKMIELHNQEYREGK
HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA
FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES
CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV
VGYGFESTESDNNKYWLVKNSWGEEWGMGGYVKMAKDRRNHCGIASAASYPTV
o How can the result be interpreted? To what kind of proteins do the sequences belong?
9. With GENSCAN (http://genes.mit.edu/GENSCAN.html or http://mobyle.pasteur.fr/cgibin/portal.py?form=genscan), analyze the entry AC012088
(http://www.bioinformatics.polimi.it/masseroli/BBCMM/dispense/esercitazioni/E1_Sequ
enceAlignment_Es9_AC012088_sequences.fasta.txt). Look at the predicted genes/exons
and peptide sequence(s), and display the graphic “PDF image of the predicted gene(s)”
that identifies the position of the predicted coding nucleotide sequences along the query
sequences.
o How can the result be interpreted?
Marco Masseroli, PhD
9
Download