Classification of real and pseudo microRNA precursors using local

advertisement
Classification of real and pseudo microRNA
precursors using local structure-sequence
features and support vector machine
Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li,
and Xeugong Zhang
CISC 841 Bioinformatics
Nehar
1
Background: miRNAs





Single-stranded RNA, ~ 20-25 nucleotides, that play a
regulatory role in gene expression.
Transcribed as long primary miRNA having a hairpin
structure.
pri-miRNA processed by nuclear RNase III Drosha into
~60-70 nt long pre-miRNA.
pre-miRNA actively transported from the nucleus to the
cytoplasm by Exportin-5.
Cleaved into ~20-25 nt mature miRNA.
2
Background: The ‘hairpin loop’

Sequence of nucleotides where two segments can form
base-pairs with each other, but a segment within that
sequence can not.
3
Background: The ‘hairpin loop’
The sequence
---CCTGCXXXXXXXGCAGG--Forms the hairpin structure
---C G--CG
TA
GC
CG
X X
X
X
X X
X
4
Background: The ‘hairpin loop’

Sequence of nucleotides where two segments can form
base-pairs with each other, but a segment within that
sequence can not.

The pre-miRNA 'hairpin' is an important secondary
structure for identifying miRNAs.

Since mature miRNAs are very short (~20 nt), sequence
alignment is not very useful for identification of
miRNAs.

Solution is to make use the hairpin structure of premiRNA.
5
The problem 




There are many sequence segments that fold into similar
stem-loop hairpin structure.
so existing methods for identification of miRNAs must
utilize comparative genomics information besides the
structure features. An example: Filter out hairpins not
conserved in related species.
This implies an inability to identify miRNAs without close
known homologues.
Furthermore, for species without closely related species
sequenced comparative genomics approaches can't be
applied.
6
Proposed solution 



ab initio (from first principles) classification of real premiRNA from "pseudo " pre-miRNA i.e. non pre-miRNA
sequence having the hairpin structure.
Get a set of novel features that combine local structure
and sequence information of pre-miRNA stem-loops.
Use SVM to classify as pre-miRNA and pseudo premiRNA.
7
The datasets




Sets of human pre-miRNA and pseudo-miRNA hairpins
collected to train SVM and evaluate performance.
Human pre-miRNA downloaded from the miRNA registry
database. only pre-miRNAs without multiple loops
considered (~193 or 93% of database.)
pseudo and candidate miRNA hairpins. Segments having
stem-loop structure similar to pre-miRNA but aren't premiRNA.
CODING dataset and the CONSERVED-HAIRPIN
dataset.
8
The Coding dataset




Collected from protein coding regions.
Used as negative samples in training and validation of
classifier.
Length distribution kept identical to pre-miRNAs.
Criteria for selection:



minimum 18 base pairings on the stem and hairpin.
Maximum of -15 kcal/mol free energy of secondary structure.
(numbers correspond to limits for genuine human pre-miRNAs.)
8,494 pre-miRNA-like hairpins in this dataset.
9
The Conserved-hairpin dataset




Extracted from genome region of position 56,000,001 –
57,000,000 on human chromosome 19 ( UCSC db.)
Used as a candidate dataset to evaluate the classifier.
2,444 hairpins from sequences conserved between Human
and mouse.
Most hairpins likely to be pseudo-miRNAs. In fact,
only 3 known miRNAs in this dataset.
10
Training and Test sets


For classification experiments, one training set and two test
sets built from the 3 datasets.
TR-C: Training set.



TE-C: Test set 1.


163 human pre-miRNAs (+ve samples) from the 193 human premiRNAs.
168 pseudo pre-miRNAs (-ve samples.) from the Coding dataset.
Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs
(avoiding those in TR-C.)
Conserved-hairpin dataset: Test set 2.
11
Two further test sets


Apply the SVM trained using previous sets on two further
test sets.
Cross-Species test set


581 pre-miRNAs from 11 species.
Updated test set


New batch of reported human miRNA.
Includes 39 non-redundant pre-miRNAs without multiple loops.
12
Local contiguous structure-sequence
features





Local sequence features are important in pre-miRNAs.
Authors claim – Distribution of local sub-structures (i.e.
continuously paired or unpaired structures) of premiRNAs are significantly distinct from pseudo premiRNAs.
Use a combination of local structure with sequence
information to classify real vs. pseudo miRNA hairpins.
Focus on information of 3 adjacent nucleotides (triplet
elements.)
“(“ and “)” mean paired at 5’-end and 3’-end. “.” means
unpaired. Paper doesn’t make 5’ – 3’ distinction.
13
Structure-sequence features


8 possible structure
compositions for each
triplet [ “(((“, “((.”, “(..”,
and so on]
32, (U,C,G,A)x8
structure –sequence
combinations if we
consider the middle nt.
14
Structure-sequence features


e.g. U((( => middle nt is
U and all three nts are
paired.
Count appearance of
each triplet to get a 32dimensional feature
vector (normalized).
15
SVM Classification



The SVM classifier is trained with TE-C & applied to
other test sets.
From TR-C 28/30 human pre-miRNA and 881/1000
pseudo-miRNAs correctly identified.
On Conserved hairpin set 2174/2444 structures classified
as false miRNAs.
16
SVM Classification



The triplet elements reflect contiguous fine-structures and
sequence composition. For instance “(((” => stacking of
paired bases, and “…” => bulge loops.
The success of the classifier shows that these features
reflect intrinsic characteristics of pre-miRNAs.
“(((” appears at higher frequency in pre-miRNAs. And
“…” appears more often in pseudo miRNAs.
17
SVM Classification
Average freq. of triplets in training dataset
18
SVM Classification




The triplet elements reflect contiguous fine-structures and
sequence composition. For instance “(((” => stacking of
paired bases, and “…” => bulge loops.
The success of the classifier shows that these features
reflect intrinsic characteristics of pre-miRNAs.
“(((” appears at higher frequency in pre-miRNAs. And
“…” appears more often in pseudo miRNAs.
Observations can be linked to the stability of the secondary
structure. Stacking of more continuously paired nts
decreases free energy. So, pre-miRNAs are more stable.
19
SVM Classification

Sequence information

Frequency of same triplet structure with different middle nts in
real pre-miRNAs, and across real and psuedo miRNAs varies.
20
SVM Classification
Average freq. of triplets in training dataset
21
SVM Classification across species



Applied the classifier trained on human data to other
species (Cross-Species test set.)
Pretty good performance in identifying true pre-miRNAs.
581 known pre-miRNA of 11 species. 90.9% overall
accuracy.
22
SVM Classification across species
23
Conclusion





Ab initio methods for distinguishing true pre-miRNA
from pre-miRNA-like hairpin structures are very
important.
The triplet-SVM classifier describes fine grained sequencestructure characteristics.
90% accuracy on human data.
Upto 90% accuracy on 11 other species (including plants
and virus) without using comparative genomics
information.
Current specificity of about 89% is not enough for
genome-wide applications.
24
Download