Determining the level of similarity

Molecular Biology-2015 1 HOMOLOGUES Relationships between genes/proteins Definitions: Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that share a threshold level of similarity as determined by alignment of matching bases or amino acids. Specifically, nucleotide sequences whose percent similarity is equal to or greater than 70% are termed homologous. In contrast, amino acid sequences whose percent similarity is equal to or greater than 25% are said to be homologous. Similarity is a quantitative term that defines the degree of sequence match between two compared sequences. For example, two aligned genes or segments of sequence that are homologous may have varying degrees of similarity based upon identical base matches in the alignment. In the first sequence alignment in the following figure, the sequences are obviously identical and therefore exhibit 39 matches out of 39 positions aligned, or 100% similarity. In the second alignment the aligned sequences contain 28 matches out of 39 possible. The quantitative match or degree of similarity is then 28/39 or 72%. In both cases the sequences are homologous. A atgcctgaaggcctattgtttcccagtcgattggctgct... ||||||||||||||||||||||||||||||||||||||| 39 atgcctgaaggcctattgtttcccagtcgattggctgcg... of 39 matches B atgcctgaaggcctattgtttcccagtcgattggctgct... |||||| |||||| |||||||| |||||| || atgcctcggcttatattgtatcccagtccattggcagcg... 28 of 39 matches Analogues: Genes or proteins that display the same activity but lack sufficient similarity to be homologs. (Less than 70% in the case of nucleotide sequences or less than 25% in the case of protein sequences.) Paralogs: Homologous genes or proteins produced by gene duplication are termed paralogous. Given that gene duplication occurs within the same organism/species, paralogues are sequences which share a high degree of similarity within a same species. These may have similar or different activities. Orthologs: After a speciation event, one homolog sorts with one species and the other copy with the other species. Subsequent divergence of the duplicated sequence is associated with one or the other species. Consequently, orthologues represent genes or proteins which share a high degree of similarity between different species. Molecular Biology-2015 2 Part I: Finding homologs from a nucleotide sequence 1. For this exercise you will be using the sequence represented by the mRNA accession number NM_000558.3. Obtain the corresponding definition, source organism, and FASTA sequence. To do so, go to the NCBI home page, enter the accession number in the search box, and choose nucleotide from the database options menu. 2. From the nucleotide record, obtain and save the FASTA protein sequence by clicking on the link protein_id. 3. From the nucleotide record click on the link "Run Blast" under the heading "analyze this sequence" on the right side of the page. This should bring you to the following page: Click here 4. Choose the options indicated above by the red boxes. Then click on algorithm parameters, at the bottom of the page, to obtain more options. 5. Change the following parameters: Set Max target sequences to 1000 and Expect threshold to 100. Click on Blast to start the search. Molecular Biology-2015 3 6. Once you've obtained the Blast results, as shown below, click on "Taxonomy reports" to display the different organisms in which sequence similarities were found. 7. A new page will appear as shown below. Find the entry for the same source organism. In this case Homo sapiens. Notice the number of hits and click on it to obtain a list of those records. Click here Molecular Biology-2015 4 8. This will bring you further down on the page as shown below: 9. Find the first entry of which the second letter of the accession number is the same as that of the accession number you started with, _M in this case for mRNA, but with the mention "alpha 2" in the description. Click on the accession number to obtain the record. Save the corresponding sequence in FASTA format. 10. Obtain and save the corresponding FASTA protein sequence by clicking on the link "Protein_id which can be found on the sequence record. 11. Repeat steps 6-10 to obtain the record and both the nucleotide and protein FASTA sequences for the gorilla (lowland gorilla). Note that this time the description may be the same since we're looking at a different organism. Molecular Biology-2015 5 Determining the level of similarity 1. You should have saved three nucleotide and three protein sequences; two from humans and one from the gorilla. To determine the level of similarity at the nucleotide and at the protein level we will use the program Clustal omega. Copy and paste each of the protein sequences in FASTA format into the query box. Make sure to choose the option protein as shown below. You want to have something like this in the query box: >gi|189202936 MQNDPWKWANEHFSTSDGRLIYQSSEPLPQDLTWYLEGLPPFLDISKEQSDTPNTVLWDLTYPIVAASGK TSGHSSEKLGKPTNLSRWFAEVRLWGPWLAPRQGKDRFQPDKEAVLASFERHDGVHLVLLAVSGLNEVLT TLNHDGDGRVVMNSNNDSDKDGLVRIVASVGHSLEDAVAASMYYVRKLIMAYEQSTGQINEEEKALTDDF KPEWLENWYDGLTYCTWNGLGQKLTEEKIFDALESLRKNEINISNLIIDDNWQSLNTEGGDQFSNAWVEF >gi|302408715 MARHSTIVVALALVGRAASRFDGLADTPPMGWHLLLSTSERVVSLGLRDLGYNTVVLDDCWQDPAGRDAK GKVQPDLAKFPRGMKAISDALHAQNLKFGMYSSAGELTCARFAGSLDHERDDADSFAAWGVDFLKYDNCF HMGRMGTPEISFNRFKAMSDALKASGRDIALNLCNWGEDYVHTWGASLAHAWRMSDDIYDSFTRPDDLCS CASVADPFCVAPGTQCSVLFILNKVAPFADRAIPGGWNDLDMLEVGQGGMTDEEYKAHFALWAALKSPLM LGNDLRIMDSAALSIINNPAIIALSQDPHGRAVYRVRRDVGPPRVPVADEYAAQEAHIWSGRLANGDQAV >gi|185698558 MQNDPWKWANEHFSTSDGRLIYQSSEPLPQDLTWYLEGLPPFLDISKEQSDTPNTVLWDLTYPIVAASGK TSGHSSEKLGKPTNLSRWFAEVRLWGPWLAPRQGKDRFQPDKEAVLASFERHDGVHLVLLAVSGLNEVLT CASVADPFCVAPGTQCSVLFILNKVAPFADRAIPGGWNDLDMLEVGQGGMTDEEYKAHFALWAALKSPLM LGNDLRIMDSAALSIINNPAIIALSQDPHGRAVYRVRRDVGPPRVPVADEYAAQEAHIWSGRLANGDQAV Molecular Biology-2015 6 2. Click Submit. You will be brought to a new page that shows you the alignment as shown below. Interpreting the results displayed: "*" means that the residues or nucleotides in that column are identical. ":" means that conserved substitutions are observed. "." means that semi-conserved substitutions are observed. 3. To determine the percentage of residues that fall in each category, select the alignment and paste it in “Word”. Molecular Biology-2015 7 4. Once in “Word” use the “replace function” to determine the number of semi conserved and non conserved residues. Simply select one of the symbols (for example :) and ask word to replace all occurrences by some other character (for example &). Record how many replacements were done. For your assignment indicate the percentage of identical, conserved, semi conserved and non conserved substitutions. 5. Click on results summary and then on "percent identity matrix" to obtain the percentage of identity between the different protein sequences. See below. # # # # # Percent Identity 1: gi|302408715 2: gi|189202936 3: gi|185698558 Matrix - created by Clustal2.1 100.00 18.58 64.66 18.58 100.00 62.75 64.66 62.75 100.00 6. These results are pairwise comparisons between the different sequences. Obtain from this file the percentage identity between each of the different pairs (human-1 to human-2, human-1 to gorilla, and human-2 to gorilla) 7. Repeat steps 1, 2, 5, and 6 with the nucleotides sequences. Make sure to change the sequence type to "DNA" in step 1. 8. Obtain the percentage identity between each of the different nucleotide sequence pairs (human-1 to human-2, human-1 to gorilla, and human-2 to gorilla) Part II: Finding homologs from a protein sequence 1. For this exercise you will be using the sequence represented by the protein accession number NP_038491.2. Obtain the corresponding definition, source organism, and FASTA sequence. To do so, go to the NCBI home page, enter the accession number in the search box, and choose protein from the database options menu. 2. From the record page, find the menu "Related information" further down on the page. Click on the link "Blink" Molecular Biology-2015 8 A new page similar to the one shown below will be displayed. 3. Obtain the FASTA protein sequences for each of the following organisms:  Rattus norvegicus (Rat)  Cavia porcellus (Giunea pig)  Felis catus (Cat)  Homo sapiens (Human) To do so, click on each of the accession numbers to be redirected to the corresponding records. 4. As you did previously, determine the percentage identity using Clustal omega between each pair.

Determining the level of similarity

Related documents

Products

Support

Determining the level of similarity

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib