Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He, Guo-Ping Liu, Yanda Li, and Xeugong Zhang CISC 841 Bioinformatics Nehar 1 Background: miRNAs Single-stranded RNA, ~ 20-25 nucleotides, that play a regulatory role in gene expression. Transcribed as long primary miRNA having a hairpin structure. pri-miRNA processed by nuclear RNase III Drosha into ~60-70 nt long pre-miRNA. pre-miRNA actively transported from the nucleus to the cytoplasm by Exportin-5. Cleaved into ~20-25 nt mature miRNA. 2 Background: The ‘hairpin loop’ Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. 3 Background: The ‘hairpin loop’ The sequence ---CCTGCXXXXXXXGCAGG--Forms the hairpin structure ---C G--CG TA GC CG X X X X X X X 4 Background: The ‘hairpin loop’ Sequence of nucleotides where two segments can form base-pairs with each other, but a segment within that sequence can not. The pre-miRNA 'hairpin' is an important secondary structure for identifying miRNAs. Since mature miRNAs are very short (~20 nt), sequence alignment is not very useful for identification of miRNAs. Solution is to make use the hairpin structure of premiRNA. 5 The problem There are many sequence segments that fold into similar stem-loop hairpin structure. so existing methods for identification of miRNAs must utilize comparative genomics information besides the structure features. An example: Filter out hairpins not conserved in related species. This implies an inability to identify miRNAs without close known homologues. Furthermore, for species without closely related species sequenced comparative genomics approaches can't be applied. 6 Proposed solution ab initio (from first principles) classification of real premiRNA from "pseudo " pre-miRNA i.e. non pre-miRNA sequence having the hairpin structure. Get a set of novel features that combine local structure and sequence information of pre-miRNA stem-loops. Use SVM to classify as pre-miRNA and pseudo premiRNA. 7 The datasets Sets of human pre-miRNA and pseudo-miRNA hairpins collected to train SVM and evaluate performance. Human pre-miRNA downloaded from the miRNA registry database. only pre-miRNAs without multiple loops considered (~193 or 93% of database.) pseudo and candidate miRNA hairpins. Segments having stem-loop structure similar to pre-miRNA but aren't premiRNA. CODING dataset and the CONSERVED-HAIRPIN dataset. 8 The Coding dataset Collected from protein coding regions. Used as negative samples in training and validation of classifier. Length distribution kept identical to pre-miRNAs. Criteria for selection: minimum 18 base pairings on the stem and hairpin. Maximum of -15 kcal/mol free energy of secondary structure. (numbers correspond to limits for genuine human pre-miRNAs.) 8,494 pre-miRNA-like hairpins in this dataset. 9 The Conserved-hairpin dataset Extracted from genome region of position 56,000,001 – 57,000,000 on human chromosome 19 ( UCSC db.) Used as a candidate dataset to evaluate the classifier. 2,444 hairpins from sequences conserved between Human and mouse. Most hairpins likely to be pseudo-miRNAs. In fact, only 3 known miRNAs in this dataset. 10 Training and Test sets For classification experiments, one training set and two test sets built from the 3 datasets. TR-C: Training set. TE-C: Test set 1. 163 human pre-miRNAs (+ve samples) from the 193 human premiRNAs. 168 pseudo pre-miRNAs (-ve samples.) from the Coding dataset. Remaining 30 human pre-miRNAs; 1000 pseudo pre-miRNAs (avoiding those in TR-C.) Conserved-hairpin dataset: Test set 2. 11 Two further test sets Apply the SVM trained using previous sets on two further test sets. Cross-Species test set 581 pre-miRNAs from 11 species. Updated test set New batch of reported human miRNA. Includes 39 non-redundant pre-miRNAs without multiple loops. 12 Local contiguous structure-sequence features Local sequence features are important in pre-miRNAs. Authors claim – Distribution of local sub-structures (i.e. continuously paired or unpaired structures) of premiRNAs are significantly distinct from pseudo premiRNAs. Use a combination of local structure with sequence information to classify real vs. pseudo miRNA hairpins. Focus on information of 3 adjacent nucleotides (triplet elements.) “(“ and “)” mean paired at 5’-end and 3’-end. “.” means unpaired. Paper doesn’t make 5’ – 3’ distinction. 13 Structure-sequence features 8 possible structure compositions for each triplet [ “(((“, “((.”, “(..”, and so on] 32, (U,C,G,A)x8 structure –sequence combinations if we consider the middle nt. 14 Structure-sequence features e.g. U((( => middle nt is U and all three nts are paired. Count appearance of each triplet to get a 32dimensional feature vector (normalized). 15 SVM Classification The SVM classifier is trained with TE-C & applied to other test sets. From TR-C 28/30 human pre-miRNA and 881/1000 pseudo-miRNAs correctly identified. On Conserved hairpin set 2174/2444 structures classified as false miRNAs. 16 SVM Classification The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. 17 SVM Classification Average freq. of triplets in training dataset 18 SVM Classification The triplet elements reflect contiguous fine-structures and sequence composition. For instance “(((” => stacking of paired bases, and “…” => bulge loops. The success of the classifier shows that these features reflect intrinsic characteristics of pre-miRNAs. “(((” appears at higher frequency in pre-miRNAs. And “…” appears more often in pseudo miRNAs. Observations can be linked to the stability of the secondary structure. Stacking of more continuously paired nts decreases free energy. So, pre-miRNAs are more stable. 19 SVM Classification Sequence information Frequency of same triplet structure with different middle nts in real pre-miRNAs, and across real and psuedo miRNAs varies. 20 SVM Classification Average freq. of triplets in training dataset 21 SVM Classification across species Applied the classifier trained on human data to other species (Cross-Species test set.) Pretty good performance in identifying true pre-miRNAs. 581 known pre-miRNA of 11 species. 90.9% overall accuracy. 22 SVM Classification across species 23 Conclusion Ab initio methods for distinguishing true pre-miRNA from pre-miRNA-like hairpin structures are very important. The triplet-SVM classifier describes fine grained sequencestructure characteristics. 90% accuracy on human data. Upto 90% accuracy on 11 other species (including plants and virus) without using comparative genomics information. Current specificity of about 89% is not enough for genome-wide applications. 24