The sequences searched are all hemagglutinin protein sequences

advertisement
The sequences searched are all hemagglutinin protein sequences for the influenza
viruses. There are altogether 100 sequences. All the sequences were obtained from the
protein database of the National Center for Biotechnology Information (NCBI). The
sequences may be found in the file: “sequences.fasta”. They are distributed in the 16
subtypes (H1 – H16) in the following manner:
H1N1: 7
H2N1: 7
H3N1: 6
H4N1: 3
H4N4: 1
H4N8: 3
H5N1: 3
H6N2: 7
H6N8: 2
H7N1: 10
H8N2: 1
H8N4: 4
H9N1: 9
H10N1: 2
H10N3: 2
H11N1: 1
H11N2: 6
H12N1: 2
H12N4: 2
H13N2: 3
H13N6: 6
H14N5: 2
H15N2: 1
H15N8: 3
H16N3: 7
The sequence analysis tool of the Europian Bioinformatics Institute (EBI), called
“ClustalW” is made use of to find the pairwise distances between all pairs of sequences.
The scheme followed to calculate the pairwise distance is:
Distance = 1 – {(No. of identities in the best alignment) / (total no. of residues
compared)}
Gaps are not considered in the total no. of comparisons.
The pairwise distances may be found in the file: “pairwise”
The same tool ClustalW is made use of to generate the Multiple Sequence Alignment
among the sequences. This may be found in the file: “multiple”.
ClustalW further generates the alignments in the PHYLIP format (.ph). The alignments in
the PHYLIP format may be found in the file: “align_phylip”.
To find the phylogenies, both character-based and distance-based, the PHYLogeny
Inference Package (PHYLIP) is made use of.
The tool “Protpars” in PHYLIP is used to generate the character-based phylogenetic tree.
Protpars accepts the aligned sequences in the PHYLIP format and generates the
character-based phylogenetic tree. In character-based phylogeny, the substitutions that
the nucleotides undergo at different stages in the transformation from one sequence to
another is registered. The method used by Protpars is one in which only those
substitutions in the nucleotides are counted which bring about a change in the aminoacid. The other substitutions are considered too minor to be registered.
The phylogenetic trees generated by Protpars may be found in the files: “char1” and
“char2”.
The distance-based phylogeny is also generated using the PHYLIP package. In this case,
the tools “Protdist” and “Fitch” are used one after the other to generate the phylogenetic
trees.
Protdist accepts the aligned sequences in the PHYLIP format and returns the distance
matrix which is a measure of the evolutionary distances between the various sequences.
Evolutionary distance in this case means the fraction of the amino-acids that have
undergone a change. This distance matrix may be found in the file: “dist1”.
After this, the tool Fitch accepts the distance matrix and generates the distance trees
wherein the length of the branches are a representation of the evolutionary distance
between the different stages.
The distance-based phlogenetic trees may be found in the files: “dist_tree1” and
“dist_tree2”.
An evaluation of the character based phylogeny obtained from Protpars is as follows:
The file “char_with_table” contains the character based phylogenetic trees obtained from
Protpars and each tree is followed by a table. This table describes the different stages of
evolution that the sequences go through in the transition of one sequence to the other. The
table starts with the 1st sequence in the set of aligned sequences gi| 1912345. This
sequence is denoted by “1”. The branching of the tree describes each stage of the
evolution and the different stages are labeled by numbers (1, 2, 3,…etc.).
The table describes the substitutions in the amino-acids that occur at each stage. An
amino-acid substitution is indicated by the corresponding alphabet. A “.” represents ‘no
change’, an “X” represents a substitution of an amino-acid but which amino-acid, it is
unknown. A “?” is used to indicate either an amino-acid substitution or a deletion.
Each table is broken into smaller bits because of lack of space (ie. The first bit describes
the first 40 identities of the sequences and so on). Thus using this table and observing the
tree, the different stages of evolution that the sequence go through while converting from
one form to the other may be traced.
Observing the character-based phylogenetic tree, one expected conclusion was that
almost all sequences that were of the same sub-type and that were derived from the same
source were close to each other in terms of evolution. However, a few interesting
observations are as follows:
(i)
gi| 14289397 (H7N1) found in a turkeys was very close to gi| 70608897 (H6N2)
found in mallards.
(ii)
gi| 116235395 (H11N2) found in swans was close to gi| 68137154 (also H11N2)
but found in ducks.
(iii) gi| 11936550 (H11N2) found in green wing teals was close to gi| 82654072 (also
H11N2) but found in mallards.
(iv)
gi| 118595866 and gi| 125716848, both H13N6 were close to each other although
the first was obtained from ducks whereas the latter from gulls.
(v)
gi| 11027796 (H13N6) from gulls was found to be close to gi| 32330958 (H1N1)
found in Taiwan.
An evaluation of the distance-based phylogenetic tree also produced some interesting
results.
Looking at the tree, five “clusters” of sequences could clearly be made out. This gives a
rough idea of a pattern of similarity that exists among the various subtypes.
Each of the clusters observed consists of two or more sub-clusters. The clusters observed
are as follows:
(i)
H2N1 sequences (sub-cluster) form a cluster along with the sub-cluster
comprising H5N1 sequences.
(ii)
The sub-cluster H6N2 combines with the sub-cluster formed by H13N2 and
the sub-cluster formed by H9N1 and the three sub-clusters together form a
large cluster.
(iii)
A sub-cluster comprising H8N4+H8N2 sequences combines with sub-clusters
formed by H11N1, a sub-cluster formed by H16N3 sequences and a subcluster formed by H13N6 sequences to form a large cluster.
(iv)
The sub-clusters of H3N1 sequences and H4N8+H4N1 sequences form a
cluster.
(v)
H10N3+H10N1 sequences form a sub-cluster that joins with the sub-cluster
formed by H7N1 sequences to form a cluster.
Download