Άσκηση 6 Αναζήτηση και ανάλυση ακολουθιών πρωτεϊνών χρησιμοποιώντας το Internet A few exercises below involves the alignment of 2-4 different sequences. Useful sites are for the alignment of 2 sequences : www-hto.usc.edu/software/seqaln/seqaln-query.html (select global alignment!) genome.eerie.fr/bin/align-guess.cgi Site for multiple sequence alignment : www.medkem.gu.se/ln/molbio/gene/msf.html 1. Sequence analysis Question # 3 of 8 From the previous exercise you learned that your DNA sequence has the following potential protein products: Forward strand: 5' CUGCCCUGUGCAGCUGUGGGUUGAUUCCACACUC 3' L P C A A V G * F H T C P V Q L W V D S T L A L C S C G L I P H Reverse strand: 3' GACGGGACACGUCGACACCCAACUAAGGUGUGAG 5' Q G T C S H T S E V S A R H L Q P N I G C E G Q A A T P Q N W V You will now try to identify proteins in protein sequence databases that match any of the open reading frames predicted from your DNA sequence. In the Basic Blast Search select the program 'blastp' and 'nr' as database. Try BLAST searches with the different peptide sequences. For one of them you should be able to identify a protein in the database with nearly identical amino acid sequence. What is the name of the database protein highly homologous to your sequence? ********************************************************************************** 2. Searching databases for sequence homology Sequence homology. Connect to NCBI-BLAST and subject the amino acid sequence above to a BLAST search. Go to www3.ncbi.nlm.nih.gov/BLAST/ and select "Basic BLAST search". Paste the sequence into the window of BLAST. Select "blastp" as program and "swissprot" as database. The result should be something like: Sequences producing High-scoring Segment Pairs: Score P(N) N sp|P04637|P53_HUMAN CELLULAR TUMOR ANTIGEN P53 (PHOSPHOP... 1582 6.1e-209 1 sp|P13481|P53_CERAE CELLULAR TUMOR ANTIGEN P53 1536 1.3e-202 1 sp|P41685|P53_FELCA CELLULAR TUMOR ANTIGEN P53 994 8.3e-177 2 . . . Examine the result carefully, including the alignments of query and database sequences. FASTA. 1. Use FASTA to identify homologs of the SRP54 protein in bacteria. Select as query sequence the mouse SRP54 protein ("sw:sr54_mouse"). 2. The expression from ferritin and transferrin messengers is regulated by a protein, the IRE (iron responsive element) binding protein. When the amino acid sequence of the protein was obtained from a cDNA clone there was an unexpected similarity to a previously identified protein. Use STRINGSEARCH to locate the sequence of IRE binding protein in the Swissprot database (Hint: use for instance iron,responsive as search string). Then use FASTA to compare it to the same database. Can you identify the protein related to IRE binding protein? 3. Consider the sequence em:Z82206. In the annotation section there is information about an exon (<20814..21617). Use FASTA (www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nphblast?Jform=0) to compare this sequence to the human section of the EMBL database. What seems to be the protein encoded by the exon? 4. Use the three DNA sequences 11e03t3.seq, 12c02t3.seq and 7f06t3.seq (in the directory ~/gcg/4). They are bovine sequences in "Bluescript" vectors. Perform a FASTA search in the database of rodent DNA sequences (rod:*) to see if there are any homologous sequences. For at least one of the sequences you should be able to identify protein homologues. ********************************************************************************** 3. Comparing two sequences GAP. The two sequences 1.seq and 2.seq are present in the directory ~/gcg/2. Compare the two sequences 1.seq and 2.seq using the "Gap" program. Do they look similar? Use the option "Generate statistics from randomized alignments" (under "Options" in the GAP window) to answer the question. BESTFIT. Compare the two sequences 1.seq and 2.seq using the "Bestfit" program. Do they look similar? Use the option "Generate statistics from randomized alignments" (under "Options" in the BESTFIT window) to answer the question. Do you get different results from "Gap" and "Bestfit"? Why? DOTPLOT. Compare the two sequences 1.seq and 2.seq by "dotplot" analysis. Run COMPARE with the output directed to DOTPLOT ( DOTPLOT uses the output from COMPARE to make a 2D plot) Identification of repeats with DOTPLOT. COMPARE may be used to identify repeats in a sequence. Analyze the sequence sw:prio_human (the human prion protein). In the Editor mode of Seqlab, make a copy of it to create two identical sequences. Apply these as input to COMPARE. How many repeats can you identify? Compare to the information in the annotation section (Use "Graphical features" to display the repeat regions in the Editor mode of Seqlab). Alignment of genomic sequence with mRNA. Retrieve the nt sequences em:V00594 (Human mRNA for metallothionein) and em:J00271 (corresponding genomic sequence). (Use the database browser). Compare these sequences by doing an alignment with GAP. Based on the alignment, how many exons are there in this gene? Compare your result to what's in the annotation section for J00271. The result of GAP is in this case very much dependent on what gap penalty parameters you select. Try for instance Gap creation penalty = 10 and Gap extension penalty = 0 (Under Options in the GAP window). ********************************************************************************** 4. Multiple sequence alignment. PILEUP 2. Sequence analysis of Drosophila homeotic genes reveals a region highly conserved, the homeobox. In the protein antennapedia (antp) this sequence is: Arg Arg Arg Ile Glu Ile Ala His Ala Leu Cys Leu Thr Glu Arg Gln Ile Lys Ile Trp Phe Gln Asn Arg Arg Met Lys Enter this seqence with the SeqLab editor (one-letter symbols!) and use FASTA to identify homologous sequences in the database. Then select 6-7 of these sequences and use PILEUP to align the sequences. Look at "Graphical features" to see what's in the feature section of these entries. Can you find the homeobox motif ? Use the result from PILEUP with PRETTY to display the alignment. Part of the result from PRETTY could be something like this: 301 350 HMSC_DROME YPWMKRVHLG TSTVNANGET KRQ.RTSYTR YQTLELEKEF HFNRYLTRRR HMSC_APIME .......... ..TVNANGEV KRQ.RTSYTR YQTLELEKEF HFNRYLTRRR HMAA_DROME MGSPFERVVC GDFNGPNGCP RRRGRQTYTR FQTLELEKEF HFNHYLTRRR HMAA_APIME .......... ...PGPNGCP RRRGRQTYTR FQTLELEKEF HYNHYLTRRR HMAA_SCHGR .......... .....PNGCP RRRGRQTYTR FQTLELEKEF HFNHYLTRRR HMUX_DROME .......... ....GTNG.L RRRGRQTYTR YQTLELEKEF HTNHYLTRRR HXB6_HUMAN PVYPWMQRMN SCNSSSFGPS GRRGRQTYTR YQTLELEKEF HYNRYLTRRR 351 400 HMSC_DROME RIEIAHALCL TERQIKIWFQ NRRMKWKKE. HKMASMNIVP YHMGPYGHPY HMSC_APIME RIEIAHALCL TERQIKIWFQ NRRMKWKKE. HKMASMNIVP YHMSPYGHPY HMAA_DROME RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEINEQA RRDREEQEKM HMAA_APIME RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEIN... .......... HMAA_SCHGR RIEIAHALCL TERQIKIWFQ NRRMKLKKEL RAVKEINEQA RREREEQDRL HMUX_DROME RIEMAHALCL TERQIKIWFQ NRRMKLKKEI QAIKELNEQE KQAQAQKAAA HXB6_HUMAN RIEIAHALCL TERQIKIWFQ NRRMKWKKES KLLSASQLSA EEEEEKQAE. ********************************************************************************** 5. Protein families The following four amino acid sequences are derived from human proteins that all bind and hydrolyze GTP. EF1-ALPHA MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL DKLKAERERG ITIDISLWKF ETSKYYVTII DAPGHRDFIK NMITGTSQAD CAVLIVAAGV GEFEAGISKN GQTREHALLA YTLGVKQLIV GVNKMDSTEP PYSQKRYEEI VKEVSTYIKK IGYNPDTVAF VPISGWNGDN MLEPSANMPW FKGWKVTRKD GNASGTTLLE ALDCILPPTR PTDKPLRLPL QDVYKIGGIG TVPVGRVETG VLKPGMVVTF APVNVTTEVK SVEMHHEALS EF-2 MVNFTVDQIR AIMDKKANIR NMSVIAHVDH GKSTLTDSLV CKAGIIASAR AGETRFTDTR KDEQERCITI KSTAISLFYE LSENDLNFIK QSKDGAGFLI NLIDSPGHVD FSSEVTAALR VTDGALVVVD CVSGVCVQTE TVLRQAIAER IKPVLMMNKM DRALLELQLE PEELYQTFQR IVENVNVIIS TYGEGESGPM GNIMIDPVLG TVGFGSGLHG WAFTLKQFAE MYVAKFAAKG EGQLGPAERA KKVEDMMKKL WGDRYFDPAN GKFSKSATSP EGKKLPRTFC QLILDPIFKV SRP54 KELVKLVDPG VKAWTPTKGK QNVIMFVGLQ GSGKTTTCSK LAYYYQRKGW KTCLICADTF RAGAFDQLKQ NATKARIPFY GSYTEMDPVI IASEGVEKFK NENFEIIIVD TSGRHKQEDS LFEEMLQVAN AIQPDNIVYV MDASIGQACE AQAKAFKDKV DVASVIVTKL DGHAKGGGAL SAVAATKSPI IFIGTGEHID DFEPFKTQPF ISKLLGMGDI SR-alpha RRVDMLRDIM DAQRRQRPYV VTFCGVNGVG KSTNLAKISF WLLENGFSVL IAACDTFRAG AVEQLRTHTR RLSALHPPEK HGGRTMVQLF EKGYGKDAAG IAMEAIAFAR NQGFDVVLVD TAGRMQDNAP LMTALAKLIT VNTPDLVLFV GEALVGNEAV DQLVKFNRAL ADHSMAQTPR LIDGIVLTKF DTIDDKVGAA ISMTYITSKP IVFVGTGQTY CDLRSLNAKA VVAALMKA Use multiple sequence alignment to compare them. What are the two proteins that are most closely related to each other? These two proteins form a separate class of GTP binding proteins. Can you identify in the alignment the consensus sequence GXXXXGK(S/T) (the 'X' is any amino acid) that is typical for GTP binding proteins? This sequence is part of a loop that binds the phosphate group of GTP. A few exercises below involves the alignment of 2-4 different sequences. Useful sites are for the alignment of 2 sequences : www-hto.usc.edu/software/seqaln/seqaln-query.html (select global alignment!) genome.eerie.fr/bin/align-guess.cgi Site for multiple sequence alignment : www.medkem.gu.se/ln/molbio/gene/msf.html ********************************************************************************** Profile search PROFILEMAKE and PROFILESEARCH There is evidence from sequence comparison that asparagine synthetase is evolutionary related to aspartyl - tRNA synthetase (Hinchman, S.K. et al 1992 J.Biol. Chem. 267: 144-149). The motif below is from five different aspartyl-tRNA synthetases. Syd2human PPHAGGGIGLERVTML Syd2rat PPHAGGGIGLERVTML Sydcyeast PPHAGGGIGLERVVMF Sydmyeast PPHAGFAIGFDRMCAM Sydecoli PPHAGLAFGLDRLTML Enter these sequences in the SeqLab editor and use PROFILEMAKE to create a profile from the sequences. Finally search E. coli proteins in Swissprot (sw:*_ecoli) with PROFILESEARCH using the profile from PROFILEMAKE. Can you identify the relationship with asparagine synthase? ********************************************************************************** Pattern searches 1. The program FINDPATTERNS may be used to identify patterns of a nucleotide or amino acid sequence. Search Swissprot for the sequence "GDSGGP", typical of serine proteases. Click on "Patterns" in the FINDPATTERNS window. Select "Create new" and type the sequence above. Click on Apply change --> Close --> , Then click on Run to execute the search. 2. Use FINDPATTERNS to identify zinc finger proteins. Select as pattern: Cx{2,4}Cx12Hx{3,5}H (Which means: A cysteine residue followed by any two to four amino acids, a cysteine residue, any 12 amino acids, histidine, any three to five amino acids and finally histidine. ) In the result of FINDPATTERNS can you find any proteins that are described as zinc finger proteins? 3. Identify cytochrome proteins that have exactly one methionine residue. Hints: First identify cytochromes using STRINGSEARCH . Then use the output from STRINGSEARCH with FINDPATTERNS. Select "M" (=methionine) as the pattern to search for. Click on "Options..." in the FINDPATTERNS window and select both "Minimum... "and "Maximum number of occurences" = 1. 4. Search for protein motifs in the sequence of human tissue plasminogen activator (sw:urot_human). Make use the MOTIFS program that looks for motifs as specified in PROSITE. Compare the result from MOTIFS with the information in the annotation section for the Swissprot entry. ********************************************************************************** Miscellaneous programs Protein secondary structure. HELICALWHEEL is used to display the arrangement of residues in an -helical structure. Create the sequence "LRKQF KEMKK MMKQM TNMS" with the SeqLab editor and examine it with HELICALWHEEL. PEPTIDESORT. PEPTIDESORT examines an amino acid sequence for proteolytic cleavage sites. Exercise: Cleave sw:gag_rsvp with trypsin. What fragments are obtained? Secondary structure prediction of nucleic acids 1. 2. Run "mfold" and "plotfold" on ecrna.seq and mmrna.seq (in directory ~/gcg/7). Include the sequences in the squiggle plot. Run "bestfit" and compare the sequences. Discuss the results. Try to find prokaryotic transcription terminator structures with the TERMINATOR program. Try the sequences "em:ssdestn", "em:ectrpx", and "em:bsrggad"