Sequence alignment and sequence databank search exercises Biomolecular sequence alignment and sequence databank search by using publicly available Web tools Help and Tutorial Web pages for the use of all the below mentioned tools are available at each tool Web site. Please go through such tutorials before carrying on the following exercises. I. Needleman-Wunsch Global alignment of two sequences by using the following implementations: • Needle - EMBOSS Pairwise Alignment (European Bioinformatics Institute) (http://www.ebi.ac.uk/Tools/emboss/align/) • Needle - at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=needle) II. Smith-Waterman Local alignment of two sequences by using the following implementations: • MPsrch (http://www.ebi.ac.uk/MPsrch/) • Scanps2.3 (http://www.ebi.ac.uk/scanps/) • SSEARCH - Protein Similarity Search (http://www.ebi.ac.uk/Tools/fasta33/) • Water - at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=water) III. FASTA Sequence similarity searching by using the implementations available: • FASTA - at the FASTA Server of the University of Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_class.shtml) • FASTA - at EBI (http://www.ebi.ac.uk/Tools/fasta/) IV. BLAST Sequence local similarity searching by using the implementations available: • BLAST - at EBI (http://www.ebi.ac.uk/Tools/blast2/) Marco Masseroli, PhD 1 Sequence alignment and sequence databank search exercises • BLAST - at NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi) • BLAST - at University of Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=blast) • Blast2Seq (http://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi) V. ClustalW Multiple alignments of several sequences by using the following implementations: • ClustalW - at EBI (http://www.ebi.ac.uk/Tools/clustalw2/) • ClustalW - at EMBNET (http://www.ch.embnet.org/software/ClustalW.html) Align some of the homolog sequences found by using the above algorithm implementations in order to find conserved regions within them. VI. GENSCAN • • GENSCAN (http://genes.mit.edu/GENSCAN.html) GENSCAN at Institut Pasteur (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=genscan) Practical: 1. Of the following two sequences Sequence 1: MTPARGSALS Sequence 2: MTPVRRSLS by using the grid method and the BLOSUM62 substitution matrix calculate by hand the optimal global alignment with a linear gap penalty of -1, and the optimal local alignment with a linear gap penalty of -2 and a threshold of 28. o Verify the obtained global alignment and its score by using the EBI EMBOSS Needle tool (http://www.ebi.ac.uk/Tools/emboss/align/) or the Needle tool at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=needle) that implement the Needleman-Wunsch global alignment algorithm. • Does the result change by using the BLOSUM40 as substitution matrix? • Does the result change by using different gap penalty (e.g. -2, -3, -5)? • See what happen to the result by using the BLOSUM62 as substitution matrix and the same different gap penalty (e.g. -2, -3, -5). Marco Masseroli, PhD 2 Sequence alignment and sequence databank search exercises o Verify the obtained local alignment and its score by using the Water tool at Pasteur Institute (http://mobyle.pasteur.fr/cgi-bin/portal.py?form=water) that implements the Smith-Waterman local alignment algorithm. • Does the result change by using the PAM120 as substitution matrix? • Does the result change by using different gap penalty (e.g. -1, -3, -5)? • See what happen to the result by using the BLOSUM62 as substitution matrix and the same different gap penalty (e.g. -1, -3, -5). 2. By using the implementations available at the FASTA Server of the University of Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_class.shtml) perform the following set of exercises that illustrate important principles in sequence similarity searching: I) the effect of scoring matrices, gap penalties and algorithms (similarity in the results produced by FASTA, BLAST, GGSEARCH (Needleman-Wunsch) and SSEARCH (Smith-Waterman)); II) detection of local duplications from significant similarity. Most of the searches in these exercises should be done against a small protein database, e.g. the PIR1 database available at the FASTA web site of the University of Virginia. Searching a small database makes it practical to consider each of the high scoring similarities, and to evaluate further whether they are likely to be biologically meaningful. I. Effects of scoring matrices, gap penalties and algorithms o Compare Drosophila glutathione transferase - GSTT1_DROME (gi|121694) (http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&list_uids=121694& dopt=fasta) to the PIR1 Annotated protein sequence database using the FASTA protein:protein program with its default parameters at the FASTA search page (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=fa& query=121694&db=p) • Look at E-values, % identities and % similarities; how they vary / are correlated? • Examine how the expectation value changes with different scoring matrices (BLOSUM62, BlastP62, PAM250) and different gap penalties. (The default scoring matrix for the FASTA programs is BLOSUM50, with gap penalties of -10 to open a gap and -2 for each residue in the gap, e.g. -12 for a one residue gap). • How does the distribution of scores and the E-values of the highest scoring sequences vary when the gap-open/gap-ext penalties are smaller (e.g. -7/1)? Marco Masseroli, PhD 3 Sequence alignment and sequence databank search exercises • Perform the same search with the ggsearch program (Needleman-Wunsch) (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm= gnw&query=gstt1_drome&db=p) and ssearch program (Smith-Waterman) (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=select&pgm=s w&query=gstt1_drome&db=p) with their default parameters. - Look at the best similar sequences found and their E-values; how do they vary with respect to those from the FASTA algorithm? • Search with ktup=1 (What is ktup?), look at the results and compare them to the results obtained with ktup=2; how do the search results and speed change? o Do the same search above (Drosophila glutathione transferase - GSTT1_DROME (gi|121694)) using the BLAST web page at the University of Virginia (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=blast). • How do the blastp E-values compare with the FASTA (blosum62) E-values? o Perform analyses similar to the above ones by using the FASTA implementation at http://www.ebi.ac.uk/Tools/fasta/. II. Significant similarities within sequences (e.g. domain duplication); exploring domains with local alignments o Examine local similarities between human calmodulin CALM_HUMAN and itself using lalign program with its default parameters (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=lal& query=calm_human&q_type=acc&query2=calm_human&q2_type=acc). o Plot the same alignment using plalign program with its default parameters (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=pal &query=calm_human&q_type=acc&query2=calm_human&q2_type=acc). • How many repeats are present in this sequence? o What does it happen to the domain alignment plot when you use a shallower scoring matrix (e.g. MD20)? o Examine local similarities between human Death Associated Protein Kinase 1 (DAPK1) DAPK1_HUMAN and itself, using lalign program with its default parameters (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=lal& query=dapk1_human&q_type=acc&query2=dapk1_human&q2_type=acc). o Plot the same alignment using the plalign program with its default parameters (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm=pal &query=dapk1_human&q_type=acc&query2=dapk1_human&q2_type=acc). Marco Masseroli, PhD 4 Sequence alignment and sequence databank search exercises • How many repeats are present in this sequence? Try zooming in by doing the alignment plot using the subset of the sequence from 350-650 (http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign&pgm= pal&query=dapk1_human&q_type=acc&ssr=350650&query2=dapk1_human&q2_type=acc&ssr2=350-650). o What does it happen to the domain alignment plot when you use a shallower scoring matrix (e.g. MD20)? o Look at the PFAM annotation of this protein at: DAPK1_HUMAN Pfam (http://pfam.janelia.org/protein?entry=p53355). • How many repeats are present? Of which domain? o For more complex domain alignments, try mouse RNA polymerase (rpb1_mouse, residues 1500-) against itself, using the BLOSUM50, as well as MD20, scoring matrixes. • Check the result by looking at the PFAM annotation of this protein (http://pfam.janelia.org//protein/rpb1_mouse). How many repeats are present in the searched region? Of which domain? 3. With the FASTA protein sequence below perform a (protein) BLAST search (http://blast.ncbi.nlm.nih.gov/Blast.cgi) in the NCBI Non-redundant protein sequences (nr) databank using the blastp program with its default parameters. >query sequence MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGC LSPSCLSLLHLQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIAD MLLGFLVMPVSMLTILYGYRWPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNP IHHSRFNSRTKAFLKIIAVWTISVGISMPIPVFGLQDDSKVFKEGSCLLADDNFVLIGSF VSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLASFSFLPQSSLSSEKLFQRSIH REPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICKESCNEDVIGA LLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYK SSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV o How many similar sequences are found? o What information can be extracted from the graph in the result page? o What are the most similar sequences found? o What is the query sequence? According to what do you identify the query sequence? o See what happen if you perform the same BLAST search using as query sequence only a fragment of the previous query sequence. Marco Masseroli, PhD 5 Sequence alignment and sequence databank search exercises 4. Perform (nucleotide) BLAST searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi) of the nucleotide sequence with accession number AB037513 using blastn and tblastx programs against the Reference genomic sequences (refseq_genomic) database, limiting the searches to the organism Drosophila melanogaster (tax_id: 7227) and with their other default parameters. o What does encode the sequence with accession number AB037513? o How many similar sequences are found in each case (blastn or tblastx searches)? o What can be stated regarding the quality of hits? o What are the differences between the two program blastn and tblastx, and how do the respective search results originate? 5. Use the BLAST implementations from EBI (http://www.ebi.ac.uk/Tools/blast2/) and NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi), with their default parameters, to search for the two FASTA test sequences below. >test sequence 1 CGCGGCCGCGGTTCGCTGTGGCGGGCGCCTGGGCCGCCGGCTGTTTAACTTCGCTTCCGC TGGCCCATAGTGATCTTTGCAGTGACCCAGCAGCATCACTGTTTCTTGGCGTGTGAAGAT AACCCAAGGAATTGAGGAAGTTGCTGAGAAGAGTGTGCTGGAGATGCTCTAGGAAAAAAT TGAATAGTGAGACGAGTTCCAGCGCAAGGGTTTCTGGTTTGCCAAGAAGAAAGTGAACAT CATGGATCAGAACAACAGCCTGCCACCTTACGCTCAGGGCTTGGCCTCCCCTCAGGGTGC CATGACTCCCGGAATCCCTATCTTTAGTCCAATGATGCCTTATGGCACTGGACTGACCCC ACAGCCTATTCAGAACACCAATAGTCTGTCTATTTTGGAAGAGCAACAAAGGCAGCAGCA GCAACAACAACAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCAACAGCAGCAGCAGCA GCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGGCAGTGGCAGCTGC AGCCGTTCAGCAGTCAACGTCCCAGCAGGCAACACAGGGAACCTCAGGCCAGGCACCACA GCTCTTCCACTCACAGACTCTCACAACTGCACCCTTGCCGGGCACCACTCCACTGTATCC CTCCCCCATGACTCCCATGACCCCCATCACTCCTGCCACGCCAGCTTCGGAGAGTTCTGG GATTGTACCGCAGCTGCAAAATATTGTATCCACAGTGAATCTTGGTTGTAAACTTGACCT AAAGACCATTGCACTTCGTGCCCGAAACGCCGAATATAATCCCAAGCGGTTTGCTGCGGT AATCATGAGGATAAGAGAGCCACGAACCACGGCACTGATTTTCAGTTCTGGGAAAATGGT GTGCACAGGAGCCAAGAGTGAAGAACAGTCCAGACTGGCAGCAAGAAAATATGCTAGAGT TGTACAGAAGTTGGGTTTTCCAGCTAAGTTCTTGGACTTCAAGATTCAGAACATGGTGGG GAGCTGTGATGTGAAGTTTCCTATAAGGTTAGAAGGCCTTGTGCTCACCCACCAACAATT TAGTAGTTATGAGCCAGAGTTATTTCCTGGTTTAATCTACAGAATGATCAAACCCAGAAT TGTTCTCCTTATTTTTGTTTCTGGAAAAGTTGTATTAACAGGTGCTAAAGTCAGAGCAGA AATTTATGAAGCATTTGAAAACATCTACCCTATTCTAAAGGGATTCAGGAAGACGACGTA ATGGCTCTCATGTACCCTTGCCTCCCCCACCCCCTTCTTTTTTTTTTTTTAAACAAATCA GTTTGTTTTGGTACCTTTAAATGGTGGTGTTGTGAGAAGATGGATGTTGAGTTGCAGGGT GTGGCACCAGGTGATGCCCTTCTGTAAGTGCCCACCGCGGGATGCCGGGAAGGGGCATTA TTTGTGCACTGAGAACACCGCGCAGCGTGACTGTGAGTTGCTCATACCGTGCTGCTATCT GGGCAGCGCTGCCCATTTATTTATATGTAGATTTTAAACACTGCTGTTGACAAGTTGGTT TGAGGGAGAAAACTTTAAGTGTTAAAGCCACCTCTATAATTGATTGGACTTTTTAATTTT AATGTTTTTCCCCATGAACCACAGTTTTTATATTTCTACCAGAAAAGTAAAAATCTTTTT TAAAAGTGTTGTTTTTCTAATTTATAACTCCTAGGGGTTATTTCTGTGCCAGACACATTC CACCTCTCCAGTATTGCAGGACGGAATATATGTGTTAATGAAAATGAATGGCTGTACATA TTTTTTTCTTTCTTCAGAGTACTCTGTACAATAAATGCAGTTTATAAAAGTGTTAAAAAA AAAAAAAAAAAAAAAA Marco Masseroli, PhD 6 Sequence alignment and sequence databank search exercises >test sequence 2 MDQNNSLPPYAQGLASPQGAMTPGIPIFSPMMPYGTGLTPQPIQNTNSLSILEEQQRQQQ QQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQAVAAAAVQQSTSQQATQGTSGQAPQ LFHSQTLTTAPLPGTTPLYPSPMTPMTPITPATPASESSGIVPQLQNIVSTVNLGCKLDL KTIALRARNAEYNPKRFAAVIMRIREPRTTALIFSSGKMVCTGAKSEEQSRLAARKYARV VQKLGFPAKFLDFKIQNMVGSCDVKFPIRLEGLVLTHQQFSSYEPELFPGLIYRMIKPRI VLLIFVSGKVVLTGAKVRAEIYEAFENIYPILKGFRKTT o What the two test sequences are? o Which their homolog (paralog and ortholog) sequences are? 6. Perform a blast2sequences alignment (http://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi) between proteins P28223 and Q24511 by using the blastp program and its default parameters. o What are the two proteins? o How can the result be interpreted? 7. Perform a multiple alignment of the three sequences below with CLUSTALW (http://www.ebi.ac.uk/Tools/clustalw2/) and with its default parameters. In the result Web page, at the end of the Alignment section, click the “Show Colors” button to display color coded properties of the amino acids in the sequences. >gi|543727|sp|P28223.2|5HT2A_HUMAN RecName: Full=5-hydroxytryptamine receptor 2A; Short=5-HT-2A; Short=5-HT-2; AltName: Full=Serotonin receptor 2A MDILCEENTSLSSTTNSLMQLNDDTRLYSNDFNSGEANTSDAFNWTVDSENRTNLSCEGCLSPSCLSLLH LQEKNWSALLTAVVIILTIAGNILVIMAVSLEKKLQNATNYFLMSLAIADMLLGFLVMPVSMLTILYGYR WPLPSKLCAVWIYLDVLFSTASIMHLCAISLDRYVAIQNPIHHSRFNSRTKAFLKIIAVWTISVGISMPI PVFGLQDDSKVFKEGSCLLADDNFVLIGSFVSFFIPLTIMVITYFLTIKSLQKEATLCVSDLGTRAKLAS FSFLPQSSLSSEKLFQRSIHREPGSYTGRRTMQSISNEQKACKVLGIVFFLFVVMWCPFFITNIMAVICK ESCNEDVIGALLNVFVWIGYLSSAVNPLVYTLFNKTYRSAFSRYIQCQYKENKKPLQLILVNTIPALAYK SSQLQMGQKKNSKQDAKTTDNDCSMVALGKQHSEEASKDNSDGVNEKVSCV >gi|7296517|gb|AAF51802.1| CG7485-PA [Drosophila melanogaster] MPSADQILFVNVTTTVAAAALTAAAAVSTTKSGSGNAARGYTDSDDDAGMGTEAVANISGSLVEGLTTVT AALSTAQADKDSAGECEGAVEELHASILGLQLAVPEWEALLTALVLSVIIVLTIIGNILVILSVFTYKPL RIVQNFFIVSLAVADLTVALLVLPFNVAYSILGRWEFGIHLCKLWLTCDVLCCTSSILNLCAIALDRYWA ITDPINYAQKRTVGRVLLLISGVWLLSLLISSPPLIGWNDWPDEFTSATPCELTSQRGYVIYSSLGSFFI PLAIMTIVYIEIFVATRRRLRERARANKLNTIALKSTELEPMANSSPVAASNSGSKSRLLASWLCCGRDR AQFATPMIQNDQESISSETHQPQDSSKAGPHGNSDPQQQHVVVLVKKSRRAKTKDSIKHGKTRGGRKSQS SSTCEPHGEQQLLPAGGDGGSCQPGGGHSGGGKSDAEISTESGSDPKGCIQVCVTQADEQTSLKLTPPQS STGVAAVSVTPLQKKTSGVNQFIEEKQKISLSKERRAARTLGIIMGVFVICWLPFFLMYVILPFCQTCCP TNKFKNFITWLGYINSGLNPVIYTIFNLDYRRAFKRLLGLN >gi|10726392|gb|AAF54255.2| CG8007-PA, isoform A [Drosophila melanogaster] MKNRFFFSFTGYFPLGSEHCLTWICLDVLFCTASIMHLCTISVDRYLSLRYPMRFGRNKTRRRVTLKIVF Marco Masseroli, PhD 7 Sequence alignment and sequence databank search exercises VWLLSIAMSLPLSLMYSKNHASVLVNGTCQIPDPVYKLVGSIVCFYIPLGVMLLTYCLTVRLLARQRQNL GGGQQTAAATPGWASGWLGQAPALGGPAGAAPATVGSTLSPHSAHNQGSSLTVQSDGSGYLAAPGTPCPG RRKLSISKTASVVTWDSSRHRRRGSSFGGVRTSLLLTPTKTATTSNSSTPLRRSATLRSHQNMNYQGAGE CGGKTRTTTSSPCMLQRQQTVRSHHSRNSSVISRNSSRHGRIIRLEQKATKVLGVVFFTFVILWSPFFVL NLLPTVCAECEERISHWVFDVVTWLGYASSMVNPIFYTIFNKVFRQAFKKVLLCRYSSTSAWRPSR o How can the result be interpreted? 8. Perform a multiple alignment of the six sequences below gi|19424144, gi|21245114, gi|2499874, gi|4503155, gi|1705638, and gi|15214962 with CLUSTALW (http://www.ebi.ac.uk/Tools/clustalw2/) and with its default parameters. In the result Web page, at the end of the Alignment section, click the “Show Colors” button to display color coded properties of the amino acids in the sequences. Then, calculate a phylogenetic tree for the six proteins by clicking the “Show as Phylogram Tree” button at the beginning of the Guide Tree section in the result Web page. >gi|19424144|ref|NP_081182.2| cathepsin 3 precursor [Mus musculus] MTPAVFLAILWLVMASSSPSPDPILDAEWQKWKIKYGKTYSLEEEGQKRAVWEENMKKIKLHNGENGLGK HGFTMEMNAFGDMTLEEFRKEMIEIPVPTVKKGKSVQKRLSVNLPKFINWKKRGYVTPVRTQIACNSCWA ISVTGAIEGQMFRKTGQLIPLSVQNLVDCVDGSGCHAGSVLDSFKYLMEKGGLESEATYPYEDKQGSCRY NPENSTASITGFEFIPNNEVDLMSAVASLGPISVVIDAWHESFLFYKRGIYYEPNCNNSLFALRHAVLLV GYGFIGRESEGRKYWIIKNSLGTKWGYKGYMKIAKDQGNHCGIASLPVFPRV >gi|21245114|ref|NP_640355.1| cathepsin Q [Rattus norvegicus] MTPAVFLVILCLGVVPGASALDLSLDVQWQEWKIKYEKLYSPEEEVLKRVVWEENVKKIELHNRENSLGK NTYTMEINDFADMTDEEFKDMIIGFQLPVHNTEKRLWKRALGSFFPNSWNWRDALPKFVDWRNEGYVTRV RKQGGCSSCWAFPVTGAIEGQMFKKTGKLIPLSVQNLIDCSKPQGNRGCLWGNTYNAFQYVLHNGGLEAE ATYPYERKEGVCRYNPKNSSAKITGFVVLPESEDVLMDAVATKGPIATGVHVISSSFRFYQKGVYHEPKC SSYVNHAVLVVGYGFEGNETDGNNYWLIKNSWGKRWGLRGYMKIAKDRNNHCAIASLAQYPTV >gi|2499874|sp|Q28944.1|CATL1_PIG RecName: Full=Cathepsin L1; Contains: RecName: Full=Cathepsin L1 heavy chain; Contains: RecName: Full=Cathepsin L1 light chain; Flags: Precursor MKPSLFLTALCLGIASAAPKLDQNLDADWYKWKATHGRLYGMNEEGWRRAVWEKNMKMIELHNQEYSQGK HGFSMAMNAFGDMTNEEFRQVMNGFQNQKHKKGKVFHESLVLEVPKSVDWREKGYVTAVKNQGQCGSCWA FSATGALEGQMFRKTGKLVSLSEQNLVDCSRPQGNQGCNGGLMDNAFQYVKDNGGLDTEESYPYLGRETN SCTYKPECSAANDTGFVDIPQREKALMKAVATVGPISVAIDAGHSSFQFYKSGIYYDPDCSSKDLDHGVL VVGYGFEGTDSNSSKFWIVKNSWGPEWGWNGYVKMAKDQNNHCGISTAASYPTV >gi|4503155|ref|NP_001903.1| cathepsin L1 preproprotein [Homo sapiens] MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNMKMIELHNQEYREGK HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV VGYGFESTESDNNKYWLVKNSWGEEWGMGGYVKMAKDRRNHCGIASAASYPTV >gi|1705638|sp|P25975|CATL_BOVIN Cathepsin L precursor [Contains: Cathepsin L heavy chain; Cathepsin L light chain] MNPSFFLTVLCLGVASAAPKLDPNLDAHWHQWKATHRRLYGMNEEEWRRAVWEKNKKIIDLHNQEYSEGK HAFRMAMNAFGDMTNEEFRQVMNGFQNQKHKKGKLFHEPLLVDVPKSVDWTKKGYVTPVKNQGQCGSCWA FSATGALEGQMFRKTGKLVSLSEQNLVDCSRAQGNQGCNGGLMDNAFQYIKDNGGLDSEESYPYLATDTN Marco Masseroli, PhD 8 Sequence alignment and sequence databank search exercises SCNYKPECSAANDTGFVDIPQREKALMKAVATVGPISVAIDAGHTSFQFYKSGIYYDPDCSCKDLDHGVL VVGYGFEGTDSNNNKFWIVKNSWGPEWGWNGYVKMAKDQNNHCGIATAASYPTV >gi|15214962|gb|AAH12612.1| Cathepsin L1 [Homo sapiens] MNPTLILAAFCLGIASATLTFDHSLEAQWTKWKAMHNRLYGMNEEGWRRAVWEKNVKMIELHNQEYREGK HSFTMAMNAFGDMTSEEFRQVMNGFQNRKPRKGKVFQEPLFYEAPRSVDWREKGYVTPVKNQGQCGSCWA FSATGALEGQMFRKTGRLISLSEQNLVDCSGPQGNEGCNGGLMDYAFQYVQDNGGLDSEESYPYEATEES CKYNPKYSVANDTGFVDIPKQEKALMKAVATVGPISVAIDAGHESFLFYKEGIYFEPDCSSEDMDHGVLV VGYGFESTESDNNKYWLVKNSWGEEWGMGGYVKMAKDRRNHCGIASAASYPTV o How can the result be interpreted? To what kind of proteins do the sequences belong? 9. With GENSCAN (http://genes.mit.edu/GENSCAN.html or http://mobyle.pasteur.fr/cgibin/portal.py?form=genscan), analyze the entry AC012088 (http://www.bioinformatics.polimi.it/masseroli/BBCMM/dispense/esercitazioni/E1_Sequ enceAlignment_Es9_AC012088_sequences.fasta.txt). Look at the predicted genes/exons and peptide sequence(s), and display the graphic “PDF image of the predicted gene(s)” that identifies the position of the predicted coding nucleotide sequences along the query sequences. o How can the result be interpreted? Marco Masseroli, PhD 9