Determining the level of similarity

advertisement
Molecular Biology-2015
1
HOMOLOGUES
Relationships between genes/proteins
Definitions:
Heterologues: Genes or proteins that possess different sequences and activities.
Homologues: Genes or proteins that share a threshold level of similarity as determined by alignment
of matching bases or amino acids. Specifically, nucleotide sequences whose percent similarity is
equal to or greater than 70% are termed homologous. In contrast, amino acid sequences whose
percent similarity is equal to or greater than 25% are said to be homologous. Similarity is a
quantitative term that defines the degree of sequence match between two compared sequences. For
example, two aligned genes or segments of sequence that are homologous may have varying degrees
of similarity based upon identical base matches in the alignment. In the first sequence alignment in
the following figure, the sequences are obviously identical and therefore exhibit 39 matches out of 39
positions aligned, or 100% similarity. In the second alignment the aligned sequences contain 28
matches out of 39 possible. The quantitative match or degree of similarity is then 28/39 or 72%. In
both cases the sequences are homologous.
A
atgcctgaaggcctattgtttcccagtcgattggctgct...
|||||||||||||||||||||||||||||||||||||||
39
atgcctgaaggcctattgtttcccagtcgattggctgcg...
of
39
matches
B
atgcctgaaggcctattgtttcccagtcgattggctgct...
||||||
|||||| |||||||| |||||| ||
atgcctcggcttatattgtatcccagtccattggcagcg...
28 of 39 matches
Analogues: Genes or proteins that display the same activity but lack sufficient similarity to be
homologs. (Less than 70% in the case of nucleotide sequences or less than 25% in the case of protein
sequences.)
Paralogs: Homologous genes or proteins produced by gene duplication are termed paralogous.
Given that gene duplication occurs within the same organism/species, paralogues are sequences
which share a high degree of similarity within a same species. These may have similar or different
activities.
Orthologs: After a speciation event, one homolog sorts with one species and the other copy with the
other species. Subsequent divergence of the duplicated sequence is associated with one or the other
species. Consequently, orthologues represent genes or proteins which share a high degree of
similarity between different species.
Molecular Biology-2015
2
Part I: Finding homologs from a nucleotide sequence
1. For this exercise you will be using the sequence represented by the mRNA accession number
NM_000558.3. Obtain the corresponding definition, source organism, and FASTA sequence. To
do so, go to the NCBI home page, enter the accession number in the search box, and choose
nucleotide from the database options menu.
2. From the nucleotide record, obtain and save the FASTA protein sequence by clicking on the link
protein_id.
3. From the nucleotide record click on the link "Run Blast" under the heading "analyze this
sequence" on the right side of the page. This should bring you to the following page:
Click here
4. Choose the options indicated above by the red boxes. Then click on algorithm parameters, at the
bottom of the page, to obtain more options.
5. Change the following parameters: Set Max target sequences to 1000 and Expect threshold to 100.
Click on Blast to start the search.
Molecular Biology-2015
3
6. Once you've obtained the Blast results, as shown below, click on "Taxonomy reports" to display
the different organisms in which sequence similarities were found.
7. A new page will appear as shown below. Find the entry for the same source organism. In this case
Homo sapiens. Notice the number of hits and click on it to obtain a list of those records.
Click here
Molecular Biology-2015
4
8. This will bring you further down on the page as shown below:
9. Find the first entry of which the second letter of the accession number is the same as that of the
accession number you started with, _M in this case for mRNA, but with the mention "alpha 2" in
the description. Click on the accession number to obtain the record. Save the corresponding
sequence in FASTA format.
10. Obtain and save the corresponding FASTA protein sequence by clicking on the link "Protein_id
which can be found on the sequence record.
11. Repeat steps 6-10 to obtain the record and both the nucleotide and protein FASTA sequences for
the gorilla (lowland gorilla). Note that this time the description may be the same since we're
looking at a different organism.
Molecular Biology-2015
5
Determining the level of similarity
1. You should have saved three nucleotide and three protein sequences; two from humans and one
from the gorilla. To determine the level of similarity at the nucleotide and at the protein level we
will use the program Clustal omega. Copy and paste each of the protein sequences in FASTA
format into the query box. Make sure to choose the option protein as shown below.
You want to have something like this in the query box:
>gi|189202936
MQNDPWKWANEHFSTSDGRLIYQSSEPLPQDLTWYLEGLPPFLDISKEQSDTPNTVLWDLTYPIVAASGK
TSGHSSEKLGKPTNLSRWFAEVRLWGPWLAPRQGKDRFQPDKEAVLASFERHDGVHLVLLAVSGLNEVLT
TLNHDGDGRVVMNSNNDSDKDGLVRIVASVGHSLEDAVAASMYYVRKLIMAYEQSTGQINEEEKALTDDF
KPEWLENWYDGLTYCTWNGLGQKLTEEKIFDALESLRKNEINISNLIIDDNWQSLNTEGGDQFSNAWVEF
>gi|302408715
MARHSTIVVALALVGRAASRFDGLADTPPMGWHLLLSTSERVVSLGLRDLGYNTVVLDDCWQDPAGRDAK
GKVQPDLAKFPRGMKAISDALHAQNLKFGMYSSAGELTCARFAGSLDHERDDADSFAAWGVDFLKYDNCF
HMGRMGTPEISFNRFKAMSDALKASGRDIALNLCNWGEDYVHTWGASLAHAWRMSDDIYDSFTRPDDLCS
CASVADPFCVAPGTQCSVLFILNKVAPFADRAIPGGWNDLDMLEVGQGGMTDEEYKAHFALWAALKSPLM
LGNDLRIMDSAALSIINNPAIIALSQDPHGRAVYRVRRDVGPPRVPVADEYAAQEAHIWSGRLANGDQAV
>gi|185698558
MQNDPWKWANEHFSTSDGRLIYQSSEPLPQDLTWYLEGLPPFLDISKEQSDTPNTVLWDLTYPIVAASGK
TSGHSSEKLGKPTNLSRWFAEVRLWGPWLAPRQGKDRFQPDKEAVLASFERHDGVHLVLLAVSGLNEVLT
CASVADPFCVAPGTQCSVLFILNKVAPFADRAIPGGWNDLDMLEVGQGGMTDEEYKAHFALWAALKSPLM
LGNDLRIMDSAALSIINNPAIIALSQDPHGRAVYRVRRDVGPPRVPVADEYAAQEAHIWSGRLANGDQAV
Molecular Biology-2015
6
2. Click Submit. You will be brought to a new page that shows you the alignment as shown below.
Interpreting the results displayed:
"*" means that the residues or nucleotides in that column are identical.
":" means that conserved substitutions are observed.
"." means that semi-conserved substitutions are observed.
3. To determine the percentage of residues that fall in each category, select the alignment and paste
it in “Word”.
Molecular Biology-2015
7
4. Once in “Word” use the “replace function” to determine the number of semi conserved and non
conserved residues. Simply select one of the symbols (for example :) and ask word to replace all
occurrences by some other character (for example &). Record how many replacements were
done. For your assignment indicate the percentage of identical, conserved, semi conserved and
non conserved substitutions.
5. Click on results summary and then on "percent identity matrix" to obtain the percentage of
identity between the different protein sequences. See below.
#
#
#
#
#
Percent Identity
1: gi|302408715
2: gi|189202936
3: gi|185698558
Matrix - created by Clustal2.1
100.00
18.58
64.66
18.58
100.00
62.75
64.66
62.75
100.00
6. These results are pairwise comparisons between the different sequences. Obtain from this file the
percentage identity between each of the different pairs (human-1 to human-2, human-1 to gorilla,
and human-2 to gorilla)
7. Repeat steps 1, 2, 5, and 6 with the nucleotides sequences. Make sure to change the sequence
type to "DNA" in step 1.
8. Obtain the percentage identity between each of the different nucleotide sequence pairs (human-1
to human-2, human-1 to gorilla, and human-2 to gorilla)
Part II: Finding homologs from a protein sequence
1. For this exercise you will be using the sequence represented by the protein accession number
NP_038491.2. Obtain the corresponding definition, source organism, and FASTA sequence. To
do so, go to the NCBI home page, enter the accession number in the search box, and choose
protein from the database options menu.
2. From the record page, find the menu "Related information" further down on the page. Click on
the link "Blink"
Molecular Biology-2015
8
A new page similar to the one shown below will be displayed.
3. Obtain the FASTA protein sequences for each of the following organisms:
 Rattus norvegicus (Rat)
 Cavia porcellus (Giunea pig)
 Felis catus (Cat)
 Homo sapiens (Human)
To do so, click on each of the accession numbers to be redirected to the corresponding records.
4. As you did previously, determine the percentage identity using Clustal omega between each pair.
Download