Customized siDesigner: A tool for designing siRNAs with functional off-target filtering 1 1 2 1,3 Shaoli Das , Suman Ghosal , Karol Kozak and Jayprokas Chakrabarti 1 Indian Association for the Cultivation of Science, Kolkata, West Bengal, 700032, India LMC-RISC, ETH Zurich, Zürich, CH-8093, Switzerland 3 Gyanxet, BF 286 Salt Lake, Kolkata, West Bengal, 700064, India 2 * To whom correspondence should be addressed. Tel: +91 33 24734971; Fax: +91 33 24732805; Email: j.chakrabarti@gyanxet.com Present Address: (Jayprokas Chakrabarti), Indian Association for the Cultivation of Science, Kolkata, West Bengal, 700032, India ABSTRACT Small interfering RNA (siRNA) design has evolved over time to address two main issues of designingpotency and specificity. The specificity issue can be handled by reducing the off-target effects arising from inadvertent targets. But the standard approaches towards such off-target minimization, however, have some limitations. Some of these off-targets may interfere with the biochemical pathway being investigated. Or, it may inadvertently target cell’s metabolic pathways with unquantifiable consequences on the processes of user’s interest. Here we report development of a siRNA selection tool that uses a multi-parametric classifier for optimized selection of potent siRNAs and implements a functional off-target filtering. The functional off-target filtering minimizes the number of off-target genes which are functionally related to the direct target gene, i.e. involved in a common biological process. A text mining algorithm is used to find related biological processes associated with the direct target and each off-target transcript. It also gives the user a choice to select one or more off-targets that may be potentially more harmful, from a predicted off-target gene list to be filtered out. This tool is the first in its kind that presents the user with such customized design to reach higher specificity. Keywords: siRNA potency, specificity, functional off-target filtering, user feedback, parameter weight optimization, multi-parametric classifier. INTRODUCTION The evolution of small interfering RNA (siRNA) mediated RNA interference (RNAi) have enabled scientists to rapidly examine the consequences of depleting a particular gene product in cells (1,3). In addition to that, siRNA have good potential in drug development for therapeutic purpose (2). But the success of the RNAi experiments depends primarily on the knockdown efficiency and specificity of the siRNAs used in these experiments. Designing of efficient siRNA must satisfy the target specificity issue to provide the most suitable choice of siRNAs. Now that, some progress has been made in improvement of silencing potential of siRNAs, main challenge in computational siRNA designing lies in the specificity issue of siRNAs. Exogenous siRNAs often induce off-target effects that 1 arise from near perfect or imperfect sequence complementarity with other mRNAs. The near perfect sequence complementarity results in cleavage of these mRNAs and the imperfect complementarity results in translational repression of these off-targets. The imperfect complementarity is mainly the complementarity of the 6-7 nucleotides from 5’ end (position 2-7 or 2-8 i.e. the seed region) of the siRNA anti-sense strand, with a part of 3’ UTR of other mRNAs. This resembles the RNA interference mechanism of endogenous microRNAs (miRNAs). Investigations have revealed that off-targeting can induce false positive phenotype during RNAi-based gene function study and traditional alignment algorithms used for finding sequence homology often fail to detect a huge number of potential offtargets which arises from partial sequence complementarity like miRNAs (33). The online siRNA designed solution that consider miRNA like off-targeting potential of synthetic siRNAs (20-25), handles this off-target minimization issue by choosing a siRNA with a seed region sequence (i.e. nucleotide position 2-7 or 2-8 from the 5’ end of the siRNA guide strand) that has fewer targets in other mRNAs’ 3’ UTR. But the off-target minimization measure taken by present day siRNA designing tools has certain demerit. It has been reported in siRNA mediated gene silencing experiments that the siRNAs used for studying a particular gene function actually targeted some of its downstream pathway components in microRNA like manner and hence gave misleading result. For e.g. in a siRNA screening experiment conducted for identifying novel members of the transforming growth factor (TGF)-b pathway in a human keratinocyte cell line, unwanted off-target effect was observed due to unintended silencing of two known upstream pathway components, the TGF-b receptors 1 and 2 (TGFBR1 and TGFBR2) (35). Clearly, to arrive at the right conclusion about an experiment, the siRNA user needs to know all the probable targets of the siRNA. Further, the user needs to know if the down regulation of some of the off-targets affects the result of the specific gene silencing experiments being undertaken. If the siRNA choice inadvertently affects the pathway being studied due to off-target effects, a new siRNA needs to be chosen. Our newly developed siRNA designing tool Customized siDesigner comes with the following standard features to address this problem- 1) It presents the user all the probable off-targets of any chosen siRNA; 2) It chooses siRNAs that have minimum number of functionally related off-targets and alerts the user on presence of such off-targets, and 3) It interfaces with the user to arrive at a siRNA, through a recursive process, that has no or minimal off-target effect on the process being studied. The potency of siRNAs is highly dependent upon the selection of the region it targets. A displacement of 5-6 nucleotides position hugely alters the efficiency of siRNA. The reason behind this alteration of efficiency is the alteration in local sequence and structural features of the target region. These variations in sequence and structural features correlate with target accessibility, RISC loading or stimulation of immune response. A lot of study has been done in the field of rational siRNA designing to find appropriate design parameters that facilitate designing of effective siRNAs. Several guidelines for rational siRNA designing are provided till date (4, 5) and many siRNA commercial suppliers (12-25) provide online customized siRNA design solutions using their own set of guidelines (6-11). After so many studies regarding parameters for selecting efficient siRNAs, some improvements have been achieved in targeting success rate compared to the early cases of gene 2 knockdown experiments using siRNAs. But still there is no reliable guideline for optimized weight distribution of siRNA selection parameters. Here a machine learning algorithm like support vector machine or artificial neural network can serve excellent purpose, when trained with biologically validated siRNA data sets. Customized siDesigner combines a support vector machine classifier enabled potent siRNA selection method with a unique functional off-target filtering that improves target specificity. It enables improved selection of potent siRNAs by application of a Support Vector machine based optimization of a set of common siRNA selection parameters, as well as facilitates improved specificity by a functional off-target filtering approach that minimizes off-targeting functionally related transcripts. The functional off-target filtering approach, which finds off-target transcripts associated with related biological process (Gene ontology terms) as that of the direct target, avoids silencing of genes which may have similar phenotype as the direct target. A feedback mechanism (through a graphical user interface) allows the user to choose specific genes that needs to be filtered out from the list of predicted off-target genes recursively until his/her needs are met. This technique gives a more rational approach towards improving potency and specificity in siRNA design and handling the miRNA like off-target issue. MATERIAL AND METHODS We have evaluated present day siRNA design rules such as- low GC content, absence of long stretches of identical nucleotides, thermodynamic conditions favouring efficient RISC entry, absence of homology to other mRNAs, absence of immune stimulatory motifs in the RISC entering strand of the siRNA duplex and some position specific nucleotide compositions, in a database of validated siRNAs (36) used in experiments to examine the threshold parameters. A support vector machine classifier, trained with the optimal features set, using supervised learning is used for classifying potential and effective siRNAs (explained below). The tool also gives user some flexibility for choosing a set of stringent parameters in time of selecting potential siRNAs. Moreover, with other parameters, it uses an off-target prediction algorithm that predicts the microRNA like off-targets for each potential siRNA candidate and uses a text mining approach to find the off-targets that are functionally related with the direct target gene in terms of related GO Process. For finding transcripts with related GO process terms, a preposition tagger based text mining algorithm is used which is simple yet very useful approach. siRNAs which have a large number of functionally related off-targets are filtered out. It also has a user feedback control that enables user to filter out some off-target genes, so that they could not be off-targeted by the designed siRNAs. Fig 1 shows the work-flow of our siRNA designing algorithm. siRNA selection parameters Our siRNA selection method is implemented using a set of rules assembled from several guidelines used in different on line siRNA selection tools and siRNA design companies. These rules can be 3 divided into three categories 1) Selection of target site, 2) Thermodynamic Consideration and 3) Sequence Characteristics. Selection of target starting site The target site should be inside the ORF of the target mRNA for efficient silencing. Generally it is advised to start choosing target sites from 50-100 nucleotides downstream of the AUG start codon. We started selection from 50 nucleotides downstream of the start codon. Selection of target sites starting with AA di-nucleotide According to well established experimented observations, siRNAs with a UU overhang in their 3’ end has improved potency. We chose target sites starting with AA nucleotides. Avoiding target sites having polymorphic locus Target regions having single nucleotide polymorphism (SNP) sites was avoided. Informations about SNP sites within mRNAs are collected from Refseq RNA database. Consideration for alternative splicing: In eukaryotic organisms frequent alternative splicing results in diversification of mRNAs and finally of proteins. To account for the alternative splicing, our design chooses target sites common for all alternatively spliced isoforms. Thermodynamic property for efficient RISC loading: In an siRNA duplex, antisense strand with relatively low energy in 5’ end is favourable for its loading into RISC complex. So, we chose siRNA duplexes for which difference in binding energy between the 5’ end of the sense and antisense strand is negative. Position specific nucleotide composition: Several sequence analysis of effective siRNAs revealed many position specific nucleotide compositions for enhancing potency of the siRNA. Some of these preferences are listed below: -Presence of A base at position 1 of the sense strand. -Presence of A or U base at position 18 of the sense strand. -Presence of C base at position 3 of the sense strand, -A base other than G at position 16 of the sense strand. -No occurrences of successive four or more identical nucleotides- like four A’s or four T’s. Sequence feature for efficient RISC entry: siRNA guide strands with relatively low 5’ end energy are favoured for entering the RISC complex. So, presence of at least three (A/U)s in the seven nucleotides at the 3' end of the sense strand is preferable. siRNA duplex stability: Target sites with low G/C content (generally less than 60%) has a greater potential for being functional siRNA site, as too high G/C content can impede the loading of siRNAs into RISC complex. Also, too low G/C content is not favourable because too low G/C content can destabilize the siRNA duplex and reduce their affinity to target mRNA binding. Analysis of effective siRNAs showed G/C content between 35% and 60% is most favourable. Table 1 lists all the parameter sets used in Customized siDesigner. Minimizing off-target effects 4 Depending on its nature, off-target effects induced by the siRNA can be divided into three categories (26) as discussed below. Near perfect complementarity with other mRNAs: mRNAs other than intended targets which exhibit near perfect sequence complementarity with the siRNA are likely to be degraded by the siRNA. This kind of off-targets is minimized by choosing targets sites that do not have 16 or more consecutive base homology with any other mRNA. We calculated homology using SmithWaterman algorithm. miRNA like off-target effect: We predicted off-targets by finding seed complements in other mRNA’s 3’UTR and seeing the conservation of the target site among closely related species. The conserved regions within 3’ UTRs of human mRNAs are collected from UTRdb (37). The seed region of the siRNA is considered to be 7 mer, position 2-8 of the antisense strand. We considered transcripts with perfect 7 mer seed complementarity as well as one base mismatch tolerance (position 7 or 8) in the seed region with 3’ compensatory complementarity (at position 13-18). We have also considered seed complement frequency (SCF) that is how many times the 7 mer seed region finds complementarity in the 3’ UTR of a transcript. For two or more 3' UTR matches, the probability of silencing increases significantly. So, we considered SCF greater than or equal to 2 for predicting off-targets. Stimulation of innate immune response: siRNAs can induce unwanted toxic effects by activating innate immune system. Exogenous siRNAs are prone to be recognized by Toll-like receptors (TLRs), mainly TLR7, TLR8 and TLR9. TLR7 and TLR8 recognize synthetic siRNAs in a sequence dependent manner (28). There seems to be preferential recognition of GU-rich sequences. So we considered selecting siRNA target sites lacking GU rich regions. Also presence of the motif “GUCCUUCAA” in the siRNA RISC entering strand is known to be immune stimulatory (29). So, these motifs are avoided in the time of designing of siRNAs. Classification of efficient siRNAs To classify between potential siRNA and non-potential candidates, a SVM based machine learning approach is used. The support vector machine is trained with the above mentioned feature set of both potential siRNA and non-potential siRNA samples. The set of 200 highly efficient and 200 poorly efficient siRNA candidates is collected from siRecords, a database of validated siRNAs (36). The support vector machine is trained using a Gaussian kernel and Sequential Minimal Optimization (SMO) algorithm developed by John C. Platt (30). We used SVM-JAVA, the Java implementation of SVM by X. Jiang and H. Yu (31). For quantifying the efficiency of the classified siRNAs, we used the score output from SVM; for present case that is: ............................ (1) where K is a kernel function (here Gaussian function) that measures the distance between the input vector x (siRNA selection parameter set) and the stored training vector xj (siRNA selection parameter set for siRNAs with known efficiency), yj is the known output for input xj, αj is Lagrangian multiplier for xj and b is the bias; α and b are optimized by the SVM. 5 Training the SVM classifier For training and testing the classifier, a total of 400 validated positive and negative examples are collected from siRecords database (data in supplementary file). For positive samples, a set of 200 highly efficient validated siRNAs targeting different genes is collected from siRNA database (34). The 200 negative samples are also collected from this database, the set marked to have low efficiency. The support vector machine is trained with total 300 samples, 150 samples from each class (Data in supplementary table 1), using a Gaussian kernel with parameter sigma equal to 0.01. Functional off-target filtering by minimizing off-targeting functionally related transcripts The functional off-target filtering approach basically filters out siRNA candidates for which the number of functionally correlated off-target genes (that is the predicted off-target genes that share same or related biological process, i.e. Gene Ontology (GO) Process terms exceeds the pre-defined threshold. We collected the GO process terms associated with each human gene from NCBI gene2go mapping database (32). Basically, after predicting a gene as off-target our algorithm compares its GO Process terms with that of the direct target gene. To find related GO terms it uses a simple yet very useful approach- which is preposition tagging. This approach is more efficient in GO mining than the traditional n-gram (i.e. bigram, trigram etc.) approach for text mining. For GO mining, our objective is to find the most important part of the whole GO Process term. The traditional bigram/trigram approach fails here as often it is difficult to say whether the first two/three words or the last two/three words will be more important. So, we used a preposition tagger- which tags the prepositions appearing in the GO Process term. Depending on the type of preposition, it then divides the whole term to two sub parts- before and after the preposition and compares them separately for each set of GO terms. If one term matches another or is contained in the other, then we consider the pair of GO terms similar and treat the respective pair of genes (the direct target and an off-target) as functionally related. Fig. 2 illustrates the text mining approach used for finding related biological processes. We put a threshold of 5 for functionally correlated off-target genes- i.e. any candidate siRNA with more than five functionally correlated off-targets will be filtered out. User feedback approach for off-target filtering After classifying potential siRNA candidates which is displayed to the user, the user can select a siRNA to view its predicted off-target genes. Among the list of off-targets, he can select the off-targets he wants to be filtered out. The program runs again to select potential siRNA candidates that are not predicted to have the user selected off-targets. One can run the program again and again until the required off-target filtering conditions are met. Thus, the off-target filtering by user feedback is accomplished. Implementation 6 We developed Customized siDesigner in Java programming language (version 1.6) with a Swing user interface using Netbeans IDE. Our siRNA designing tool takes as input the target gene symbol and returns most potential siRNA candidates using a set of siRNA design rules. We used the current version of RefSeq RNA database (27) to collect human mRNA sequences. For implementation of the SVM classifier, we used SVM-JAVA, the Java implementation of SVM by X. Jiang and H. Yu (31). RESULTS Testing with four different data sources For evaluation of our algorithm in terms of selection of effective siRNAs, we tested our system with 4 datasets from different sources. We measured effectiveness of the classifier by 1) classification accuracy- that depict how well it recognises the true class of the data (effective or non effective), 2) sensitivity- that depicts how well it recognises the true positive data, i.e. efficient siRNAs, 3) specificity- that depicts how well it recognises the true negative data, i.e. poorly efficient siRNAs. The classification accuracy, specificity and sensitivity of the classifier are calculated based on the following formulaACC = TP+TN/TP+FN+FP+TN (1) SN = TP/TP+FN (2) SP = TN/TN+FP (3) Where, ACC is classification accuracy, SN is sensitivity, SP is specificity, TP is total number of true positives, FN is number of false negatives, TN is number of true negatives and FP is number of false positives. Testing with siRecords data. The trained classifier is tested with test set from two different sourcessiRecords and Huesken data. Test set from siRecords database comprise of total 100 samples- 50 positive samples and 50 negative samples (Data in supplementary table 1). The classification accuracy is 67% for 100 test samples while the sensitivity and specificity are 60% and 74% respectively. Testing with Huesken data. The Huesken dataset (38) is a set of experimentally knockdown validated siRNAs of length 21 nucleotides. The siRNAs in the Huesken data samples are annotated with their respective normalized knockdown efficiency. This dataset comprise of 2430 total siRNAs. We have categorized efficient and non-efficient siRNAs from this dataset based on a cut-off for their knockdown efficiency (knockdown>0.5, knockdown>0.6, knockdown>0.7 and knockdown>0.8 for efficient siRNAs and <0.4 for non-efficient siRNAs). The sensitivity increases significantly with the increasing threshold for positive samples, i.e. the siRNAs with knockdown efficiency>0.7 and 0.8 are predicted more accurately than siRNAs with knockdown efficiency>0.6 and 0.5 (72.508% for knockdown>0.5, 7 74.001% for knockdown>0.6, 75.752% for knockdown>0.7 and 79.886% for knockdown>0.8) (see fig. 3a). This result shows that our algorithm can identify highly efficient siRNAs more accurately than siRNAs with moderate efficiency. Specificity improved significantly for high scoring top ranked siRNAs (based on algorithm score), but at the expense of decreased sensitivity. The specificity gradually increased for top 30%, top 20% and top 10% siRNAs respectively (see fig. 3b). We achieved a high specificity of 87.92% when considered knockdown efficiency>0.6 as efficient siRNAs. We observed significant improvement in specificity for top 30 siRNAs based on the algorithm score. Within the 30 top scorers, 29 siRNAs (96.66%) have knockdown efficiency>0.6 and only one siRNA have knockdown efficiency<0.4. Furthermore, 25 siRNAs (83.33%) have knockdown efficiency>0.7 and 20 siRNAs (66.66%) have knockdown efficiency>0.75. To compare our algorithm strength with BioPredsi algorithm (which used artificial neural network classifier for classification of effective siRNAs), we tested with the same test set as BioPredsi; which is a 249 siRNA set. Considering knockdown efficiency cutoff as 0.66 for efficient siRNAs, our algorithm achieved a higher specificity (93.2%) than BioPredsi (83.3%) for top 10% of the classified siRNAs. However, in terms of sensitivity BioPredsi worked better. But in case of siRNA selection, a greater specificity is more desirable than a greater sensitivity so that the chance of picking nonfunctional siRNAs is minimized. Testing with MIT siRNA database. We have further tested Cusomized siDesigner on the MIT database of experimentally validated 143 siRNAs (39), with known mRNA knockdown levels ranging from 25-100%, for 104 different human genes. siRNAs designed by Customized siDesigner had 68.5% overlap with siRNAs from this database and at least one siRNA overlap for 68 genes (Data in supplementary table 2). Testing with genome wide siRNA library. siRNAs designed by Customized siDesigner for human genes in genome wide scale are checked with genome wide siRNA library from Qiagen that has been used in Biogenesis screening experiment (40). The biogenesis project is dealing with ribosomes, which are macromolecular complexes used to synthesize proteins. The numbers of control wells at each plate was equal: 16 control wells (positive/negative) on first two columns and 16 control wells (positive/negative) on last two plates columns in experiment (in total 9472 wells). On each plate, Allstars siRNA, Qiagen were used as negative controls and Crm1 siRNA as positive controls. The Qiagen library contains 76009 non overlapping siRNAs for 19600 target human genes. The oligos designed by Customized siDesigner has overall 23% overlap with the Qiagen library, and 6779 genes share at least one common oligo included in both libraries. The overlap increased (28.83%) when we restricted the genes to most efficient siRNAs from Qiagen library (Data in supplementary table 3). We also checked for overlap of oligos in the Qiagen library that target the 1000 genes playing significant role in biogenesis. Out of these 1000 genes, 652 genes share at least one common oligo with Qiagen 8 library and Customized siDesigner, among which 609 genes share common siRNA with measured signal in biogenesis screen. Reducing off-target effects Functional off-target filtering. The functional off target filtering allows avoiding transcripts with related GO Process from being off targeted. Table 4 shows some siRNAs designed for the gene MMP11 by five different online siRNA designing tool with their predicted off targets that have related GO Process with the direct target. We evaluated the performance of our text mining algorithm that has been used for finding similar or related GO biological process terms by checking parent-child relationships from the full gene ontology file (41) downloaded from the Gene Ontology website (42). The ontology file contains 22773 GO biological process terms with their similar or related GO terms (related by “is a” or “part of” relationship). Our text mining algorithm identified 55.715% of the related terms. Table 2 shows some siRNAs designed for the gene MMP11 by five different online siRNA designing tool with their predicted off targets that have related GO Process with the direct target. We checked siRNAs from MIT siRNA database (siRNAs designed by Qiagen) for siRNAs with such off-targets that are involved in similar or related biological process as the direct target. Data is shown in the supplementary table 4. User feedback for customized specificity. An user interface is built using Java Swing for ease of use. The user interface enables user to provide the target gene symbol and presents the list of most potent siRNAs. On selecting a particular siRNA from the list, the user can view it’s predicted off target gene list from which he may select one or more off targets to be filtered out. Fig 4 shows a screenshot of the tool and usage of the feedback process. DISCUSSION The two main issues in siRNA selection are potency and specificity of the siRNA. After two decades of research for efficient siRNA selection parameters, today siRNA targeting success rate has improved significantly compared to earlier times of siRNA mediated gene silencing experiments. Still a large number of designed siRNAs within the siRNA libraries used in screening experiments are non functional and yields poor knockdown result. Also improvement in specificity, i.e. reducing the offtarget effect arising from silencing inadvertent targets is another challenge that has not been resolved efficiently till date. Our newly developed siRNA designing tool Customized siDesigner aims to improve both potency and specificity of siRNA designing by applying a SVM based classification of potent siRNAs using a combined set of siRNA selection parameters and using a unique functional off-target filtering for improved specificity. Several siRNA sequence selection algorithms have been developed in the past decade that relies on intrinsic sequence, stability and target accessibility features of functional siRNAs. Still these algorithms often fail to predict highly functional siRNAs. A good optimization of the siRNA selection parameters and their weights is needed for successful prediction of potent siRNAs. Some methods have been suggested for optimal weight distribution among siRNA selection parameters like rational 9 or weighted methods. Machine learning algorithms like artificial neural network also have been used for classification of efficient siRNAs (38). Here we used a Support vector machine based powerful optimization with a non linear Gaussian kernel that is very suitable for classification of highly functional siRNAs from non functional ones. Our system has been tested with huge number of experimentally validated data samples from four different sources and gave sufficiently good result. For improved specificity and reduced off-target effects of the designed siRNAs, we developed a new algorithm to eliminate functionally related transcripts from being off-targeted. Till date, different siRNA designing tools have developed their own proprietary algorithms for siRNA designing, but still minimization of miRNA like off-targeting effect is rarely considered. Few tools that have considered the minimization of miRNA like off-targeting effect rely on reducing the number of such off-targets as it seems unavoidable to retain some off-target, at least by computational design. Some designing tools rely on choosing siRNAs with low seed complement frequency as it is obvious that siRNAs having a seed region that falls into the low frequency group will have fewer off-targets (34). But then in many cases, presence of such seed sequences can decrease the potency of the siRNA because they often contain stretches of identical nucleotides or other features unfavourable for a potent siRNA. So, there is a trade-off between potency and specificity of a siRNA which have to be dealt with in their computational designing. Ideally the number of off-targets needs to be zero, but such a goal is known to be unachievable. What is possible to achieve is to design a siRNA that eliminates unintended off-targets that may have potential harmful effect. But, for this to happen, the user has to interface with the design process to customize the siRNA. Customized siDesigner is conceptualized to deliver such tailor-made siRNAs. Often in siRNA screening experiments, it has been reported that the desired output is affected because of silencing of unintended off-targets that sometimes are themselves member of the upstream pathway components of the direct target gene (35). In such cases it can be useful to avoid specifically some off-targets that can cause more harmful or inconvenient undesirable effects. It should be considered that during silencing process siRNA should not silence any mRNA from the same pathway the target mRNA is part of. In case of investigation of a gene function, if any gene from the same pathway is silenced rather than target gene then it will be difficult to investigate the actual phenotype of silencing the gene under investigation. For e.g. in a siRNA screening experiment conducted for identification of novel members of the transforming growth factor (TGF)-b pathway in a human keratinocyte cell line, unwanted off-target effect was observed because of unintended silencing of two known upstream pathway components, the TGF-b receptors 1 and 2 (TGFBR1 and TGFBR2) (33). Such off-target silencing activity poses threats of confusing and misleading results. Also the siRNAs suggested by the online siRNA selection tools often are predicted to have off-targets that belong to the same pathway or somehow related to the direct target. Customized siDesigner is a siRNA selection tool with such functional off-target filtering that reduces the number of functionally related off-targets by comparing their associated GO Process term by a text mining algorithm. Our simple yet effective text mining algorithm enables identification of not only identical but also related GO biological process terms. In future we are aiming to apply a more refined text mining algorithm for 10 identifying hierarchically related GO process terms. Our tool also gives the user a choice to filter out the undesired off-targets. User can select a siRNA from a list of most potent candidates to see the predicted off-target gene list from where he may select one or more off-targets to be filtered out. This tool is the first in its kind that presents the user with such choices for customized specificity. In coming days, this kind of functional off-target filtering of siRNAs may prove to be beneficial for use in cultured cells to study gene functions. AVAILABILITY Customized siDesigner is freely available for download from http://gyanxet-beta.com. SUPPLEMENTARY DATA Supplementary Tables 1-4 lists the training and testing siRNA datasets from siRecords, Common siRNAs found between MIT siRNA database and siRNAs designed by Customized siDesigner for 68 genes, Common siRNAs from Qiagen siRNA genome wide library having a hit score greater than 0.3 in the biogenesis screening project and siRNAs designed by Customized siDesigner, and siRNAs from MIT siRNA database predicted to have functionally related off-targets, respectively. ACKNOWLEDGEMENT We express our gratitude to Prof. Urlike Kutay from Department of Biochemistry, ETH Zurich for providing siRNA screening data from Biogenesis project. We also thank Sanga Mitra and Smarajit Das from Indian Association for the Cultivation of Science for their helpful suggestions about this paper. FUNDING This work was supported by Department of Science and Technology, Government of India. REFERENCES 1. Grosshans H. and Filipowicz W., (2008) The expanding world of small RNAs. Nature, 451, 414–416. 2. Fougerolles A. D., Vornlocher H.P, Maraganore J. and Lieberman J., (2007) Interfering with disease: a progress report on siRNA-based therapeutics. Nat. Rev. Drug Discovery, 6, 443– 453. 3. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T, (2001) Duplexes of 21nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature, 411, 494– 498. 4. Pei Y. and Tuschl T., (2006) On the art of identifying effective and specific siRNAs. Nat. Methods, 3, 670–676. 11 5. Reynolds, A., D. Leake, Q. Boese, S. Scaringe, W. S. Marshall, and A. Khorova, (2004) Rational siRNA design for RNA interference. Nat. Biotechnology, 22, 326-330. 6. [http://www.rnaiweb.com/RNAi/siRNA_Design/index.html] 7. Wang L. and Forest Y. M., (2004) A Web-based design center for vector-based siRNA and siRNA cassette. Bioinformatics, 20, 1818–1820. 8. Holen T., (2006) Efficient prediction of siRNAs with siRNArules 1.0: An open-source JAVA approach to siRNA algorithms. Bioinformatics, 12, 1620–1625. 9. Yuan B., Latek R., Hossbach M., Tusch T. and Lewitter F., (2004) siRNA Selection Server: an automated siRNA oligonucleotide prediction server. Nucleic Acids Res., 32, 130–134. 10. Chalk A. M, Wahlestedt C., and Sonnhammer E. L.L., (2004) Improved and automated prediction of effective siRNA. Biochem. Biophys. Res. Commun., 319, 264–274. 11. Park Y. K., Park S. M., Choim Y. C., Lee D., Won M. and Kim Y. J., (2008) AsiDesigner: exon-based siRNA design server considering alternative splicing. Nucleic Acids Res., 36, 97– 103. 12. AsiDesigner [http://sysbio.kribb.re.kr:8080/AsiDesigner/menuDesigner.jsf]. 13. Ambion siRNA Target Finder [http://www.ambion.com/techlib/misc/siRNA_finder.html]. 14. MicroSynth siRNA design [http://www.microsynth.ch/499.0.html]. 15. IDT RNAi Design [http://www.idtdna.com/Scitools/Applications/RNAi/RNAi.aspx]. 16. Imgenex siRNA tool [http://www.imgenex.com/sirna_tool.php]. 17. BLOCK-iT RNAi Designer [https://rnaidesigner.invitrogen.com/rnaiexpress]. 18. siSearch [http://sonnhammer.cgb.ki.se/siSearch/siSearch_1.7.html]. 19. Promega siRNA Target Designer [http://www.promega.com/siRNADesigner/program/]. 20. siDESIGN Center [http://www.dharmacon.com/sidesign]. 21. siRNA Target Finder [https://www.genscript.com/ssl-bin/app/rnai]. 22. Whitehead WI siRNA Selection Program [http://jura.wi.mit.edu/bioc/siRNAext/]. 23. BIOPREDsi [http://www.biopredsi.org/]. 24. siDirect [http://sidirect2.rnai.jp]. 25. siDRM [http://sidrm.biolead.org/] 26. Jackson A.L, Linsley P.S, (2010) Recognizing and avoiding siRNA off-target effects for target identification and therapeutic application. Nat. Rev. Drug Discov., 9, 57-67. 27. Human Refseq RNA database. [ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gbff.gz] 28. Judge D., Sood V., Shaw J.R., Fang D., McClintock K. and MacLachlan I., (2005) Sequencedependent stimulation of the mammalian innateimmune response by synthetic siRNA. Nat. Biotechnol, 23, 457-462. 29. Fedorov Y, Anderson E.M, Birmingham A, Reynolds A, Karpilow J, Robinson K, Leake D, Marshall W.S, Khvorova A., (2006) Off-target effects by siRNA can induce toxic phenotype. RNA, 12, 1188-1196. 12 30. Platt J. C., (1999) Fast training of support vector machines using sequential minimal optimization, Advances in kernel methods. MIT Press Cambridge, pp 185 – 208. 31. Jiang X. and Yu H., (2008) SVM-JAVA: A Java implementation of the SMO (Sequential Minimal Optimization) for training SVM. [http://dm.postech.ac.kr/svm-java]. 32. NCBI gene2go mapping database. [ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz]. 33. Birmingham A, Anderson E.M, Reynolds A, Ilsley-Tyree D, Leake D, Fedorov Y, Baskerville S, Maksimova E, Robinson K, Karpilow J, Marshall W.S, Khvorova A, (2006) 3' UTR seed matches, but not overall identity, are associated with RNAi off-targets. Nat. Methods, 3, 199204. 34. Anderson E. M., A. Birmingham, S. Baskerville, A. Reynolds, E.Maksimova, D. Leake, Y. Fedorov, J. Karpilow and A. Khvorova, (2008) Experimental validation of the importance of seed complement frequency to siRNA specificity. RNA, 14, 853-861. 35. Schultz N, Marenstein D.R, De Angelis D.A, Wang W.Q, Nelander S, Jacobsen A, Marks D.S, Massagué J, Sander C., (2011) Off-target effects dominate a large-scale RNAi screen for modulators of the TGF-β pathway and reveal microRNA regulation of TGFBR2. Silence, 2. 36. Y. Ren, W. Gong, Q. Xu, X. Zheng, D. Lin, Y. Wang and T. Li, (2006) siRecords: an extensive database of mammalian siRNAs with efficacy ratings. Bioinformatics, 22, 1027-1028. 37. Grillo G, Turi A, Licciulli F, Mignone F, Liuni S, Banfi S, Gennarino VA, Horner DS, Pavesi G, Picardi E, Pesole G., (2010) UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res., 38, 75-80. 38. Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S, Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J., (2005) Design of a genomewide siRNA library using an artificial neural network. Nat. Biotechnol, 23, 995-1001. 39. MIT siRNA database [http://web.mit.edu/sirna/sirnas-human.html]. 40. Wild T, Horvath P, Wyler E, Widmann B, Badertscher L, Zemp I, Kozak K, Csusc G, Lund E, Kutay U, (2010) A protein inventory of human ribosome biogenesis reveals an essential function of Exportin 5 in 60S subunit export. PLoS Biol, 8:e1000522. 41. Full gene ontology file, release date 06:03:2012, version 1.2, cvs version 1.2681. 42. Gene Ontology Consortium, (2001) Creating the gene ontology resource: design and implementation. Genome Res., 11, 1425-33. TABLE AND FIGURES LEGENDS Table 1. Detailed parameter set of Customized siDesigner. Stringent Parameters: (These parameters cannot be adjusted by the user) Target starting site: Within ORF region- 50nt downstream of the start codon AUG 13 Within 3’ UTR- 50 nt downstream of the end of CDS The 21 nt target region is common for all alternatively spliced isoforms of the target gene Alternative splicing: Adjustable Parameters: (User can select or deselect these parameters to apply or not apply these rules stringently) Target starts with: AA di nucleotide GC content: 30% to 60% Thermodynamic difference of the sense and antisense strand 5’ end: Sense strand 5’ end (bases 1-6 of sense strand) binding energy- antisense strand 5’ end (bases 14-19) binding energy<0 (binding energy is calculated by base stacking energy using nearest neighbour algorithm) Presence of stimulatory motif: immune 1) No presence of “GUCCUUCAA” in the antisense strand (which is RISC entering) 2) GU content<65% SNP sites within the target region: No SNP sites within the target region Homology mRNAs: No more than 16 contiguous base homology with non-target mRNAs with other miRNA like off-targets: Number of miRNA like off targets less than 100 Functional filtering: Number of off-target genes having a related GO Process term with the target gene<5 off-target Parameters for training the SVM classifier GC content of the 19 nt core region of the target site (in %) GU content of the 19 nt core region of the target site (in %) Presence of a stretch of 4 identical nucleotides (1 or 0) Presence of immune stimulatory motif GUCCUUCAA in the antisense strand (1 or 0) Thermodynamic difference of the sense and antisense strand 5’ end 14 Number of A/U in the 14th-19th position in the sense strand (in %) Number of G/C in the 1st-6th position in the sense strand (in %) Number of presence of position specific nucleotide features - Presence of A at Position 1 + Presence of A or U at position 18 + Presence of C at position 8 + Presence of any base other than G at position 16 Table 2. siRNAs designed for the gene MMP11 by five different online siRNA designing tool with their predicted off targets that have related GO Process with the direct target siRNA selection tool Predicted off siRNA sense strand Related GO Process target siDesign centre GCACACAACAGCAGCCAAG CASP14 proteolysis SHROOM4 multicellular (Dharmacon) organismal development BiopredSi (Qiagen) CCTCCAAAGCCATTGTAAA VANGL2 multicellular organismal development SATB2 multicellular organismal development IKZF1 multicellular organismal development SPRED1 multicellular organismal development 15 WI siRNA selection AATGTGTGTACAGTGTGTATA PAFAH1B1 multicellular server (Whitehead organismal institute) development Genscript siRNA target PCSK2 proteolysis AAGGTATGGAGCGATGTGACG JOSD1 proteolysis GTTGTTTTCTAAAAAAAAAAA LTBP4 Regulation of finder (Genscript) siDirect proteolysis DISC1 multicellular organismal development EBF2 multicellular organismal development GSK3B multicellular organismal development 16 Figure 1. Workflow of the siRNA designing algorithm. Figure 2. Illustration of the text mining algorithm used for finding related GO biological process terms 17 Fig 3. (a) Sensitivity and specificity of the classifier for different threshold of knockdown efficiency in Huesken dataset. The sensitivity increases for higher knockdown efficiency score (0.4>0.5>0.6>0.7>0.8) while the specificity decreases. (b) Increase in specificity for top scoring siRNA rated by the algorithm (threshold for knockdown efficiency kept as 0.6). Specificity increases greatly for top 30%, 20% and 10% siRNAs. 18 Fig 4. A screenshot of Customized siDesigner user interface 19