Rajib_presentation

advertisement
Characteristic Restriction
Endonuclease cut order for
Classification and analysis of
DNA Sequences
Rajib SenGupta
College of Information Science and
Technology, University of Nebraska
at Omaha
Omaha,NE, 68102-0116, USA
Problem Statement
The motivation for this project is the old
holy grail of Bioinformatics
Sequence Identification & Classification
Current Approaches
1. Computational approach – Pairwise
local and Multiple Sequence
Alignment
2. Laboratory Method – RFLP, Southern
Blotting
Existing Methods - Limitations
Pairwise or Multiple Alignment
1. Alignment is ‘fine-grained’ approach
2. More computation intensive and so NP hard for large dataset
3. Introduces gaps - gaps are interpreted as evolutionary
events in molecular phylogeny, misaligned sequences have
no useful biological information
4. Heuristics like BLAST is employed
Laboratory Methods (RFLP)
1. Only feasible for few sequences
2. Human and procedural error
3. In-silico RFLP methods (TRFLP program) requires Alignment
as the second step for sequence identification
Ideation
Utilize ‘coarse-grain-features’ of RFLP/Restriction
Enzyme in-silico as opposed to the ‘fine-grainfeatures’ of Alignment computationally.
Restriction Endonuclease

Proteins that recognize particular
sequence of nucleotide (called the
restriction site and generally 4 to 8 bases
long) and cut the double stranded DNA
molecule at restriction site
RFLP





Restriction Fragment Length Polymorphism
(RFLP)
Widely used laboratory method in molecular
identification and Phylogenetic studies.
This approach requires the sequences to be cut
into several fragments with the help of restriction
endonucleases.
The variation in the position of these sites along
the DNA, among the sequences being analyzed
will lead to digested product that are of varying
lengths.
Following a high-resolution gel electrophoresis of
the digested product, the fragment-patterns are
visually compared to determine the similarity
between the sequences.
RFLP
Proposed Concept



New Idea
Uses Enzyme Cut Order (ECO) – information
from DNA for evaluation
Definition:
– ECO for a DNA sequence (S) for a particular set of
restriction enzymes {Ez} is a string (array) of
enzyme names (represented as numeric id) in the
order each enzyme (ez Є Ez) cuts the sequence.

ECO may also include position of nucleotide from the
start of sequence where the cut occur.
– Thus, ECO is a string (array) of tuples consisting
of enzyme id and cut position.
– Example:
GenBank Classification
O R G A N IS M
L iru la m a c ro s p o ra
C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ;
L e o tio m yce te s; R h ytism a ta le s; R h ytism a ta ce a e ; L iru la
O R G A N IS M
N e c tria h a e m a to c o c c a
C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ;
S o rd a rio m yce te s; H yp o cre o m yce tid a e ; H yp o cre a le s; N e ctria ce a e ;
N e ctria ; N e ctria h a e m a to co cca co m p le x.
O R G A N IS M
N e c tria m a u ritiic o la
C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; A sco m yco ta ; P e zizo m yco tin a ;
S o rd a rio m yce te s; H yp o cre o m yce tid a e ; H yp o cre a le s; N e ctria ce a e ;
N e ctria .
O R G A N IS M
O lig o p o ru s p la c e n tu s
C L A S S IF IC A T IO N : E u ka ryo ta ; F u n g i; D ika rya ; B a sid io m yco ta ; A g a rico m yco tin a ;
A g a rico m yce te s; P o lyp o ra le s; O lig o p o ru s
Concept Contd..
 Closely
related organisms have
similar Enzyme Cut Order
Table1 : The ECO for ‘ITS’ sequences from close and distantly related
fungi. The closely related Nectria species (Nectria haematococca and
Nectria mauritiicola) show high level of ECO similarity.
Quantifying ECO

Enzyme Cut Order (ECO)- Similarity Score
– The similarity score between two ECO consists


Number of similar enzymes and
Order in which these enzyme cut the sequence
1. The similarity score will be higher if we find
larger number of similar enzymes appearing in
the same order among two Enzyme Cut Orders.
2. This similarity score is the Longest Common
Subsequence (LCS) among two strings – the
strings are the ECO
3. The length of Longest Common Subsequence
(LCS) between two ECO (E1 and E2) of two
corresponding sequences (S1 and S2) are
considered as the Enzyme Cut Order Similarity
Score between E1 and E2.
Hypothesis
Organisms closer to each other in the
Phylogenetic tree have highly similar
Enzyme Cut Order. The similarity is
defined as the Enzyme Cut Order
Similarity Score which is the length of
LCS among the corresponding Enzyme
Cut Orders of the DNA sequences of
the organisms.
Preliminary Result Summary



Enzyme Cut Order is a distinguishing
characteristic of DNA sequences
The similarity between two sequences
can be defined by Enzyme Cut Order
Similarity Score
ECO-similarity score can be measured
as the length of LCS among the
corresponding Enzyme Cut Orders of
the DNA sequences of the organisms
Overall Method Diagram
Enzyme Cut Order
SEQUENCE
DB
Array of
Enzyme cut
orders
RES ENZ
DB
Similarity Score
Algorithm
Similarity Matrix
Clustering Algorithm
TAXON
DB
CLUSTER
DB
Analysis of Clusters
Report
Graph
Genetic Algorithms
Optimal
Enzyme Set
Step 1
Sequence Data Collection and Curation




Created a local database of GenBank sequences obtained in FASTA
or XML format
Reference these sequences against taxon database
Create a curated taxonomy database for these sequences using
user-defined taxonomical rules
Fungi ITS Sequences from Genbank
– Organism description” of the genbank entries (or
OrgName_Lineage in XML format)
– Classification categories included Kingdom, Division, Class,
Order, Family, Genus, Species
– Use simple suffix rule and the position to decide
Step 2 – Enzyme Data Collection



Create a database of restriction enzymes obtained from
REBASE
Add more relevant information about these restriction
enzymes (Isoschizomers, Commercial availability, Reverse
Cutsite) for later use
Appropriate recognition sequence containing bases other
than A, T, G and C were interpreted as per IUB ambiguity
code (Eur. J. Biochem. 150: 1-5, 1985).
Step 3 – Enzyme Cut Order DB Build



Obtained Enzyme cut order using user defined set of
restriction enzymes {Ez}.
The Enzyme cutorder is obtained for every test sequences
and every enzymes in {Ez}
Evaluate the effect of the size and type of restriction
endonuclease
Different sets of (Ez) were chosen with the following properties.
1. Enzymes that cut at least one of the sequences from the given
sequence data
2. Enzymes that cut 50% of the sequences of the given sequence
data
3. Enzymes that cut all the sequences at least once
4. Random enzyme set (consisting a mixture from the sets listed
previously)
5. Commonly used restriction enzymes in a biology laboratory
working with the RFLP of fungi.
6. Restriction enzymes set obtained by using genetic algorithm
Step 4 – Similarity Matrix based on LCS
score

Create a similarity matrix or a complete
weighted graph for each Enzyme Set {Ez}
– each node represents an enzyme cutorder of a
sequence and the weight between two nodes is
similarity score (SS = LCS length) between
two corresponding enzyme cut-order
– (G)Ez = (Kn) Ez = (v Є V, e Є E) where v is
enzyme cut order of the sequence and |e| =
SS
Step 5 – Clustering
The Similarity matrix is clustered and the cluster is
analyzed for its phylogenetic accuracy

Clustering algorithm employed:
– Maximum gap based exclusive clustering
– Hierarchical clustering
– Similarity Clustering
Step 5 – Clustering


Sensitivity and the positive predictive value were two
important evaluation parameters for cluster analysis and
are defined as follows:
For a particular taxon in a group X
S = Sensitivity = TP/(TP + FN) where
TP= True Positive = Count of taxon’s in X
FN= False Negative = Count of taxon’s in DB1, excluding that
in X
TP+FN = Total counts of taxon’s in the entire DB1
(S) tax,x = Count of the taxon in X / Total count of the taxon
in database

Similarly, for a particular taxon in a group X
PP = Positive Predictive Value = TP / (TP + FP) where
TP = True Positive = Count of taxon’s sequences in X
FP = False Positive = Count of other taxons which are not in X
TP + FP = Total counts of sequences in the group X
(PP) tax,x = Count of the taxon in X / total count of
sequences in X
Step 6 – Genetic Algorithm





Find optimal enzyme set for a particular dataset
using genetic algorithm.
Optimal enzyme set is defined as the minimal
size enzyme set that shows highest phylogenetic
resolution
The Fitness Function is based on the expected
and actual count of an organism in the cluster.
The score is quantitatively determined in terms of
Sensitivity and Positive Predictive Value
The Selection is either Roulette-wheel selection,
tournament selection or random selection.
Uniform, Single-Point or Two-Point crossover is
used along with a user specified crossover rate.
Experiment -1

Sequence (Set-1)
– Type = Internal Transcribed Spacer
– Size = 7
– Taxonomy

Ascomycota = 5
– Nectria sp.
– Lirula sp.

=3
=2
Bacidiomycota = 2
– Oligoporus sp.= 2

Enzyme (Set1 - TaqI , HaeIII, HinfI, AluI,
RsaI, MspI)
– Size
=6
– Property = Frequent cutter
Result-1
Result-1

All sequences are perfectly
clustered
– Similarity Gap is close and
reflected on highlighted samples
Sample Test Set 1 – Enz Set 2
1. Using 57 enzymes on the same test set 1
2. Obtained better Similarity Matrix (Higher Similarity Gap)
3. Larger Enzyme set may have better clustering result
4. All Sequences are perfectly clustered
Result -2
21 species are perfectly clustered out of 26 with 65 Enzymes
Experiment -3
(Find Optimal Enzyme Set using GA)
 Sequence
(AspCan)
– Type = Internal Transcribed Spacer
– Size = 78
– Taxonomy – Aspergillus and Candida
 Sequence
(All9Genus)
– Type = Internal Transcribed Spacer
– Size = 97
– Taxonomy – 9 Genus
Result -3
Conclusion





Restriction Enzymes data can be modeled and used for
computational analysis.
Introduced an new property of DNA sequences based on
order of the cut by multiple restriction enzymes on the
sequences, namely Enzyme Cut Order.
This property can be quantified to a similarity score as the
length of the Longest Common Subsequence between two
enzyme cut orders.
The resulting similarity matrix shows high phylogenetic
resolution while clustered.
Can be considered as an alternative”coarse-grain” method
for sequence identification and classification compared to
computational intensive alignment methods
Download