A tool for designing siRNAs with functional off

advertisement
Customized siDesigner: A tool for designing siRNAs with
functional off-target filtering
1
1
2
1,3
Shaoli Das , Suman Ghosal , Karol Kozak and Jayprokas Chakrabarti
1
Indian Association for the Cultivation of Science, Kolkata, West Bengal, 700032, India
LMC-RISC, ETH Zurich, Zürich, CH-8093, Switzerland
3
Gyanxet, BF 286 Salt Lake, Kolkata, West Bengal, 700064, India
2
* To whom correspondence should be addressed. Tel: +91 33 24734971; Fax: +91 33 24732805; Email:
j.chakrabarti@gyanxet.com
Present Address: (Jayprokas Chakrabarti), Indian Association for the Cultivation of Science, Kolkata,
West Bengal, 700032, India
ABSTRACT
Small interfering RNA (siRNA) design has evolved over time to address two main issues of designingpotency and specificity. The specificity issue can be handled by reducing the off-target effects arising
from inadvertent targets. But the standard approaches towards such off-target minimization, however,
have some limitations. Some of these off-targets may interfere with the biochemical pathway being
investigated. Or, it may inadvertently target cell’s metabolic pathways with unquantifiable
consequences on the processes of user’s interest. Here we report development of a siRNA selection
tool that uses a multi-parametric classifier for optimized selection of potent siRNAs and implements a
functional off-target filtering. The functional off-target filtering minimizes the number of off-target
genes which are functionally related to the direct target gene, i.e. involved in a common biological
process. A text mining algorithm is used to find related biological processes associated with the direct
target and each off-target transcript. It also gives the user a choice to select one or more off-targets
that may be potentially more harmful, from a predicted off-target gene list to be filtered out. This tool is
the first in its kind that presents the user with such customized design to reach higher specificity.
Keywords: siRNA potency, specificity, functional off-target filtering, user feedback, parameter weight
optimization, multi-parametric classifier.
INTRODUCTION
The evolution of small interfering RNA (siRNA) mediated RNA interference (RNAi) have
enabled scientists to rapidly examine the consequences of depleting a particular gene product in cells
(1,3). In addition to that, siRNA have good potential in drug development for therapeutic purpose (2).
But the success of the RNAi experiments depends primarily on the knockdown efficiency and
specificity of the siRNAs used in these experiments. Designing of efficient siRNA must satisfy the
target specificity issue to provide the most suitable choice of siRNAs. Now that, some progress has
been made in improvement of silencing potential of siRNAs, main challenge in computational siRNA
designing lies in the specificity issue of siRNAs. Exogenous siRNAs often induce off-target effects that
1
arise from near perfect or imperfect sequence complementarity with other mRNAs. The near perfect
sequence complementarity results in cleavage of these mRNAs and the imperfect complementarity
results in translational repression of these off-targets. The imperfect complementarity is mainly the
complementarity of the 6-7 nucleotides from 5’ end (position 2-7 or 2-8 i.e. the seed region) of the
siRNA anti-sense strand, with a part of 3’ UTR of other mRNAs. This resembles the RNA interference
mechanism of endogenous microRNAs (miRNAs). Investigations have revealed that off-targeting can
induce false positive phenotype during RNAi-based gene function study and traditional alignment
algorithms used for finding sequence homology often fail to detect a huge number of potential offtargets which arises from partial sequence complementarity like miRNAs (33). The online siRNA
designed solution that consider miRNA like off-targeting potential of synthetic siRNAs (20-25),
handles this off-target minimization issue by choosing a siRNA with a seed region sequence (i.e.
nucleotide position 2-7 or 2-8 from the 5’ end of the siRNA guide strand) that has fewer targets in
other mRNAs’ 3’ UTR. But the off-target minimization measure taken by present day siRNA designing
tools has certain demerit. It has been reported in siRNA mediated gene silencing experiments that the
siRNAs used for studying a particular gene function actually targeted some of its downstream
pathway components in microRNA like manner and hence gave misleading result. For e.g. in a siRNA
screening experiment conducted for identifying novel members of the transforming growth factor
(TGF)-b pathway in a human keratinocyte cell line, unwanted off-target effect was observed due to
unintended silencing of two known upstream pathway components, the TGF-b receptors 1 and 2
(TGFBR1 and TGFBR2) (35). Clearly, to arrive at the right conclusion about an experiment, the siRNA
user needs to know all the probable targets of the siRNA. Further, the user needs to know if the down
regulation of some of the off-targets affects the result of the specific gene silencing experiments being
undertaken. If the siRNA choice inadvertently affects the pathway being studied due to off-target
effects, a new siRNA needs to be chosen. Our newly developed siRNA designing tool Customized
siDesigner comes with the following standard features to address this problem- 1) It presents the user
all the probable off-targets of any chosen siRNA; 2) It chooses siRNAs that have minimum number of
functionally related off-targets and alerts the user on presence of such off-targets, and 3) It interfaces
with the user to arrive at a siRNA, through a recursive process, that has no or minimal off-target effect
on the process being studied.
The potency of siRNAs is highly dependent upon the selection of the region it targets. A
displacement of 5-6 nucleotides position hugely alters the efficiency of siRNA. The reason behind this
alteration of efficiency is the alteration in local sequence and structural features of the target region.
These variations in sequence and structural features correlate with target accessibility, RISC loading
or stimulation of immune response. A lot of study has been done in the field of rational siRNA
designing to find appropriate design parameters that facilitate designing of effective siRNAs. Several
guidelines for rational siRNA designing are provided till date (4, 5) and many siRNA commercial
suppliers (12-25) provide online customized siRNA design solutions using their own set of guidelines
(6-11). After so many studies regarding parameters for selecting efficient siRNAs, some
improvements have been achieved in targeting success rate compared to the early cases of gene
2
knockdown experiments using siRNAs. But still there is no reliable guideline for optimized weight
distribution of siRNA selection parameters. Here a machine learning algorithm like support vector
machine or artificial neural network can serve excellent purpose, when trained with biologically
validated siRNA data sets.
Customized siDesigner combines a support vector machine classifier enabled potent siRNA
selection method with a unique functional off-target filtering that improves target specificity. It enables
improved selection of potent siRNAs by application of a Support Vector machine based optimization
of a set of common siRNA selection parameters, as well as facilitates improved specificity by a
functional off-target filtering approach that minimizes off-targeting functionally related transcripts. The
functional off-target filtering approach, which finds off-target transcripts associated with related
biological process (Gene ontology terms) as that of the direct target, avoids silencing of genes which
may have similar phenotype as the direct target. A feedback mechanism (through a graphical user
interface) allows the user to choose specific genes that needs to be filtered out from the list of
predicted off-target genes recursively until his/her needs are met. This technique gives a more
rational approach towards improving potency and specificity in siRNA design and handling the miRNA
like off-target issue.
MATERIAL AND METHODS
We have evaluated present day siRNA design rules such as- low GC content, absence of long
stretches of identical nucleotides, thermodynamic conditions favouring efficient RISC entry, absence
of homology to other mRNAs, absence of immune stimulatory motifs in the RISC entering strand of
the siRNA duplex and some position specific nucleotide compositions, in a database of validated
siRNAs (36) used in experiments to examine the threshold parameters. A support vector machine
classifier, trained with the optimal features set, using supervised learning is used for classifying
potential and effective siRNAs (explained below). The tool also gives user some flexibility for choosing
a set of stringent parameters in time of selecting potential siRNAs. Moreover, with other parameters, it
uses an off-target prediction algorithm that predicts the microRNA like off-targets for each potential
siRNA candidate and uses a text mining approach to find the off-targets that are functionally related
with the direct target gene in terms of related GO Process. For finding transcripts with related GO
process terms, a preposition tagger based text mining algorithm is used which is simple yet very
useful approach. siRNAs which have a large number of functionally related off-targets are filtered out.
It also has a user feedback control that enables user to filter out some off-target genes, so that they
could not be off-targeted by the designed siRNAs. Fig 1 shows the work-flow of our siRNA designing
algorithm.
siRNA selection parameters
Our siRNA selection method is implemented using a set of rules assembled from several guidelines
used in different on line siRNA selection tools and siRNA design companies. These rules can be
3
divided into three categories 1) Selection of target site, 2) Thermodynamic Consideration and 3)
Sequence Characteristics.
Selection of target starting site The target site should be inside the ORF of the target mRNA for
efficient silencing. Generally it is advised to start choosing target sites from 50-100 nucleotides
downstream of the AUG start codon. We started selection from 50 nucleotides downstream of the
start codon.
Selection of target sites starting with AA di-nucleotide According to well established experimented
observations, siRNAs with a UU overhang in their 3’ end has improved potency. We chose target
sites starting with AA nucleotides.
Avoiding target sites having polymorphic locus Target regions having single nucleotide polymorphism
(SNP) sites was avoided. Informations about SNP sites within mRNAs are collected from Refseq RNA
database.
Consideration for alternative splicing: In eukaryotic organisms frequent alternative splicing results in
diversification of mRNAs and finally of proteins. To account for the alternative splicing, our design
chooses target sites common for all alternatively spliced isoforms.
Thermodynamic property for efficient RISC loading: In an siRNA duplex, antisense strand with
relatively low energy in 5’ end is favourable for its loading into RISC complex. So, we chose siRNA
duplexes for which difference in binding energy between the 5’ end of the sense and antisense strand
is negative.
Position specific nucleotide composition: Several sequence analysis of effective siRNAs revealed
many position specific nucleotide compositions for enhancing potency of the siRNA. Some of these
preferences are listed below:
-Presence of A base at position 1 of the sense strand.
-Presence of A or U base at position 18 of the sense strand.
-Presence of C base at position 3 of the sense strand,
-A base other than G at position 16 of the sense strand.
-No occurrences of successive four or more identical nucleotides- like four A’s or four T’s.
Sequence feature for efficient RISC entry: siRNA guide strands with relatively low 5’ end energy are
favoured for entering the RISC complex. So, presence of at least three (A/U)s in the seven
nucleotides at the 3' end of the sense strand is preferable.
siRNA duplex stability: Target sites with low G/C content (generally less than 60%) has a greater
potential for being functional siRNA site, as too high G/C content can impede the loading of siRNAs
into RISC complex. Also, too low G/C content is not favourable because too low G/C content can
destabilize the siRNA duplex and reduce their affinity to target mRNA binding. Analysis of effective
siRNAs showed G/C content between 35% and 60% is most favourable.
Table 1 lists all the parameter sets used in Customized siDesigner.
Minimizing off-target effects
4
Depending on its nature, off-target effects induced by the siRNA can be divided into three categories
(26) as discussed below.
Near perfect complementarity with other mRNAs: mRNAs other than intended targets which exhibit
near perfect sequence complementarity with the siRNA are likely to be degraded by the siRNA. This
kind of off-targets is minimized by choosing targets sites that do not have 16 or more consecutive
base homology with any other mRNA. We calculated homology using SmithWaterman algorithm.
miRNA like off-target effect: We predicted off-targets by finding seed complements in other mRNA’s
3’UTR and seeing the conservation of the target site among closely related species. The conserved
regions within 3’ UTRs of human mRNAs are collected from UTRdb (37). The seed region of the
siRNA is considered to be 7 mer, position 2-8 of the antisense strand. We considered transcripts with
perfect 7 mer seed complementarity as well as one base mismatch tolerance (position 7 or 8) in the
seed region with 3’ compensatory complementarity (at position 13-18). We have also considered seed
complement frequency (SCF) that is how many times the 7 mer seed region finds complementarity in
the 3’ UTR of a transcript. For two or more 3' UTR matches, the probability of silencing increases
significantly. So, we considered SCF greater than or equal to 2 for predicting off-targets.
Stimulation of innate immune response: siRNAs can induce unwanted toxic effects by activating
innate immune system. Exogenous siRNAs are prone to be recognized by Toll-like receptors (TLRs),
mainly TLR7, TLR8 and TLR9. TLR7 and TLR8 recognize synthetic siRNAs in a sequence dependent
manner (28). There seems to be preferential recognition of GU-rich sequences. So we considered
selecting siRNA target sites lacking GU rich regions. Also presence of the motif “GUCCUUCAA” in the
siRNA RISC entering strand is known to be immune stimulatory (29). So, these motifs are avoided in
the time of designing of siRNAs.
Classification of efficient siRNAs
To classify between potential siRNA and non-potential candidates, a SVM based machine learning
approach is used. The support vector machine is trained with the above mentioned feature set of both
potential siRNA and non-potential siRNA samples. The set of 200 highly efficient and 200 poorly
efficient siRNA candidates is collected from siRecords, a database of validated siRNAs (36). The
support vector machine is trained using a Gaussian kernel and Sequential Minimal Optimization
(SMO) algorithm developed by John C. Platt (30). We used SVM-JAVA, the Java implementation of
SVM by X. Jiang and H. Yu (31). For quantifying the efficiency of the classified siRNAs, we used the
score output from SVM; for present case that is:
............................ (1)
where K is a kernel function (here Gaussian function) that measures the distance between the input
vector x (siRNA selection parameter set) and the stored training vector xj (siRNA selection parameter
set for siRNAs with known efficiency), yj is the known output for input xj, αj is Lagrangian multiplier for
xj and b is the bias; α and b are optimized by the SVM.
5
Training the SVM classifier For training and testing the classifier, a total of 400 validated positive and
negative examples are collected from siRecords database (data in supplementary file). For positive
samples, a set of 200 highly efficient validated siRNAs targeting different genes is collected from
siRNA database (34). The 200 negative samples are also collected from this database, the set
marked to have low efficiency. The support vector machine is trained with total 300 samples, 150
samples from each class (Data in supplementary table 1), using a Gaussian kernel with parameter
sigma equal to 0.01.
Functional off-target filtering by minimizing off-targeting functionally related transcripts
The functional off-target filtering approach basically filters out siRNA candidates for which the number
of functionally correlated off-target genes (that is the predicted off-target genes that share same or
related biological process, i.e. Gene Ontology (GO) Process terms exceeds the pre-defined threshold.
We collected the GO process terms associated with each human gene from NCBI gene2go mapping
database (32). Basically, after predicting a gene as off-target our algorithm compares its GO Process
terms with that of the direct target gene. To find related GO terms it uses a simple yet very useful
approach- which is preposition tagging. This approach is more efficient in GO mining than the
traditional n-gram (i.e. bigram, trigram etc.) approach for text mining. For GO mining, our objective is
to find the most important part of the whole GO Process term. The traditional bigram/trigram approach
fails here as often it is difficult to say whether the first two/three words or the last two/three words will
be more important. So, we used a preposition tagger- which tags the prepositions appearing in the
GO Process term. Depending on the type of preposition, it then divides the whole term to two sub
parts- before and after the preposition and compares them separately for each set of GO terms. If one
term matches another or is contained in the other, then we consider the pair of GO terms similar and
treat the respective pair of genes (the direct target and an off-target) as functionally related. Fig. 2
illustrates the text mining approach used for finding related biological processes. We put a threshold
of 5 for functionally correlated off-target genes- i.e. any candidate siRNA with more than five
functionally correlated off-targets will be filtered out.
User feedback approach for off-target filtering
After classifying potential siRNA candidates which is displayed to the user, the user can select a
siRNA to view its predicted off-target genes. Among the list of off-targets, he can select the off-targets
he wants to be filtered out. The program runs again to select potential siRNA candidates that are not
predicted to have the user selected off-targets. One can run the program again and again until the
required off-target filtering conditions are met. Thus, the off-target filtering by user feedback is
accomplished.
Implementation
6
We developed Customized siDesigner in Java programming language (version 1.6) with a Swing user
interface using Netbeans IDE. Our siRNA designing tool takes as input the target gene symbol and
returns most potential siRNA candidates using a set of siRNA design rules. We used the current
version of RefSeq RNA database (27) to collect human mRNA sequences. For implementation of the
SVM classifier, we used SVM-JAVA, the Java implementation of SVM by X. Jiang and H. Yu (31).
RESULTS
Testing with four different data sources
For evaluation of our algorithm in terms of selection of effective siRNAs, we tested our system with 4
datasets from different sources. We measured effectiveness of the classifier by 1) classification
accuracy- that depict how well it recognises the true class of the data (effective or non effective), 2)
sensitivity- that depicts how well it recognises the true positive data, i.e. efficient siRNAs, 3)
specificity- that depicts how well it recognises the true negative data, i.e. poorly efficient siRNAs. The
classification accuracy, specificity and sensitivity of the classifier are calculated based on the following
formulaACC = TP+TN/TP+FN+FP+TN (1)
SN = TP/TP+FN (2)
SP = TN/TN+FP (3)
Where, ACC is classification accuracy, SN is sensitivity, SP is specificity, TP is total number of true
positives, FN is number of false negatives, TN is number of true negatives and FP is number of false
positives.
Testing with siRecords data. The trained classifier is tested with test set from two different sourcessiRecords and Huesken data. Test set from siRecords database comprise of total 100 samples- 50
positive samples and 50 negative samples (Data in supplementary table 1). The classification
accuracy is 67% for 100 test samples while the sensitivity and specificity are 60% and 74%
respectively.
Testing with Huesken data. The Huesken dataset (38) is a set of experimentally knockdown validated
siRNAs of length 21 nucleotides. The siRNAs in the Huesken data samples are annotated with their
respective normalized knockdown efficiency. This dataset comprise of 2430 total siRNAs. We have
categorized efficient and non-efficient siRNAs from this dataset based on a cut-off for their knockdown
efficiency (knockdown>0.5, knockdown>0.6, knockdown>0.7 and knockdown>0.8 for efficient siRNAs
and <0.4 for non-efficient siRNAs). The sensitivity increases significantly with the increasing threshold
for positive samples, i.e. the siRNAs with knockdown efficiency>0.7 and 0.8 are predicted more
accurately than siRNAs with knockdown efficiency>0.6 and 0.5 (72.508% for knockdown>0.5,
7
74.001% for knockdown>0.6, 75.752% for knockdown>0.7 and 79.886% for knockdown>0.8) (see fig.
3a). This result shows that our algorithm can identify highly efficient siRNAs more accurately than
siRNAs with moderate efficiency.
Specificity improved significantly for high scoring top ranked siRNAs (based on algorithm
score), but at the expense of decreased sensitivity. The specificity gradually increased for top 30%,
top 20% and top 10% siRNAs respectively (see fig. 3b). We achieved a high specificity of 87.92%
when considered knockdown efficiency>0.6 as efficient siRNAs.
We observed significant improvement in specificity for top 30 siRNAs based on the algorithm
score. Within the 30 top scorers, 29 siRNAs (96.66%) have knockdown efficiency>0.6 and only one
siRNA have knockdown efficiency<0.4. Furthermore, 25 siRNAs (83.33%) have knockdown
efficiency>0.7 and 20 siRNAs (66.66%) have knockdown efficiency>0.75.
To compare our algorithm strength with BioPredsi algorithm (which used artificial neural
network classifier for classification of effective siRNAs), we tested with the same test set as BioPredsi;
which is a 249 siRNA set. Considering knockdown efficiency cutoff as 0.66 for efficient siRNAs, our
algorithm achieved a higher specificity (93.2%) than BioPredsi (83.3%) for top 10% of the classified
siRNAs. However, in terms of sensitivity BioPredsi worked better. But in case of siRNA selection, a
greater specificity is more desirable than a greater sensitivity so that the chance of picking nonfunctional siRNAs is minimized.
Testing with MIT siRNA database. We have further tested Cusomized siDesigner on the MIT
database of experimentally validated 143 siRNAs (39), with known mRNA knockdown levels ranging
from 25-100%, for 104 different human genes. siRNAs designed by Customized siDesigner had
68.5% overlap with siRNAs from this database and at least one siRNA overlap for 68 genes (Data in
supplementary table 2).
Testing with genome wide siRNA library. siRNAs designed by Customized siDesigner for human
genes in genome wide scale are checked with genome wide siRNA library from Qiagen that has been
used in Biogenesis screening experiment (40). The biogenesis project is dealing with ribosomes,
which are macromolecular complexes used to synthesize proteins. The numbers of control wells at
each plate was equal: 16 control wells (positive/negative) on first two columns and 16 control wells
(positive/negative) on last two plates columns in experiment (in total 9472 wells). On each plate,
Allstars siRNA, Qiagen were used as negative controls and Crm1 siRNA as positive controls. The
Qiagen library contains 76009 non overlapping siRNAs for 19600 target human genes. The oligos
designed by Customized siDesigner has overall 23% overlap with the Qiagen library, and 6779 genes
share at least one common oligo included in both libraries. The overlap increased (28.83%) when we
restricted the genes to most efficient siRNAs from Qiagen library (Data in supplementary table 3). We
also checked for overlap of oligos in the Qiagen library that target the 1000 genes playing significant
role in biogenesis. Out of these 1000 genes, 652 genes share at least one common oligo with Qiagen
8
library and Customized siDesigner, among which 609 genes share common siRNA with measured
signal in biogenesis screen.
Reducing off-target effects
Functional off-target filtering. The functional off target filtering allows avoiding transcripts with related
GO Process from being off targeted. Table 4 shows some siRNAs designed for the gene MMP11 by
five different online siRNA designing tool with their predicted off targets that have related GO Process
with the direct target. We evaluated the performance of our text mining algorithm that has been used
for finding similar or related GO biological process terms by checking parent-child relationships from
the full gene ontology file (41) downloaded from the Gene Ontology website (42). The ontology file
contains 22773 GO biological process terms with their similar or related GO terms (related by “is a” or
“part of” relationship). Our text mining algorithm identified 55.715% of the related terms. Table 2
shows some siRNAs designed for the gene MMP11 by five different online siRNA designing tool with
their predicted off targets that have related GO Process with the direct target. We checked siRNAs
from MIT siRNA database (siRNAs designed by Qiagen) for siRNAs with such off-targets that are
involved in similar or related biological process as the direct target. Data is shown in the
supplementary table 4.
User feedback for customized specificity. An user interface is built using Java Swing for ease of use.
The user interface enables user to provide the target gene symbol and presents the list of most potent
siRNAs. On selecting a particular siRNA from the list, the user can view it’s predicted off target gene
list from which he may select one or more off targets to be filtered out. Fig 4 shows a screenshot of
the tool and usage of the feedback process.
DISCUSSION
The two main issues in siRNA selection are potency and specificity of the siRNA. After two decades
of research for efficient siRNA selection parameters, today siRNA targeting success rate has
improved significantly compared to earlier times of siRNA mediated gene silencing experiments. Still
a large number of designed siRNAs within the siRNA libraries used in screening experiments are non
functional and yields poor knockdown result. Also improvement in specificity, i.e. reducing the offtarget effect arising from silencing inadvertent targets is another challenge that has not been resolved
efficiently till date. Our newly developed siRNA designing tool Customized siDesigner aims to improve
both potency and specificity of siRNA designing by applying a SVM based classification of potent
siRNAs using a combined set of siRNA selection parameters and using a unique functional off-target
filtering for improved specificity.
Several siRNA sequence selection algorithms have been developed in the past decade that relies on
intrinsic sequence, stability and target accessibility features of functional siRNAs. Still these
algorithms often fail to predict highly functional siRNAs. A good optimization of the siRNA selection
parameters and their weights is needed for successful prediction of potent siRNAs. Some methods
have been suggested for optimal weight distribution among siRNA selection parameters like rational
9
or weighted methods. Machine learning algorithms like artificial neural network also have been used
for classification of efficient siRNAs (38). Here we used a Support vector machine based powerful
optimization with a non linear Gaussian kernel that is very suitable for classification of highly
functional siRNAs from non functional ones. Our system has been tested with huge number of
experimentally validated data samples from four different sources and gave sufficiently good result.
For improved specificity and reduced off-target effects of the designed siRNAs, we developed a new
algorithm to eliminate functionally related transcripts from being off-targeted. Till date, different siRNA
designing tools have developed their own proprietary algorithms for siRNA designing, but still
minimization of miRNA like off-targeting effect is rarely considered. Few tools that have considered
the minimization of miRNA like off-targeting effect rely on reducing the number of such off-targets as it
seems unavoidable to retain some off-target, at least by computational design. Some designing tools
rely on choosing siRNAs with low seed complement frequency as it is obvious that siRNAs having a
seed region that falls into the low frequency group will have fewer off-targets (34). But then in many
cases, presence of such seed sequences can decrease the potency of the siRNA because they often
contain stretches of identical nucleotides or other features unfavourable for a potent siRNA. So, there
is a trade-off between potency and specificity of a siRNA which have to be dealt with in their
computational designing.
Ideally the number of off-targets needs to be zero, but such a goal is known to be unachievable. What
is possible to achieve is to design a siRNA that eliminates unintended off-targets that may have
potential harmful effect. But, for this to happen, the user has to interface with the design process to
customize the siRNA. Customized siDesigner is conceptualized to deliver such tailor-made siRNAs.
Often in siRNA screening experiments, it has been reported that the desired output is affected
because of silencing of unintended off-targets that sometimes are themselves member of the
upstream pathway components of the direct target gene (35). In such cases it can be useful to avoid
specifically some off-targets that can cause more harmful or inconvenient undesirable effects. It
should be considered that during silencing process siRNA should not silence any mRNA from the
same pathway the target mRNA is part of. In case of investigation of a gene function, if any gene from
the same pathway is silenced rather than target gene then it will be difficult to investigate the actual
phenotype of silencing the gene under investigation. For e.g. in a siRNA screening experiment
conducted for identification of novel members of the transforming growth factor (TGF)-b pathway in a
human keratinocyte cell line, unwanted off-target effect was observed because of unintended
silencing of two known upstream pathway components, the TGF-b receptors 1 and 2 (TGFBR1 and
TGFBR2) (33). Such off-target silencing activity poses threats of confusing and misleading results.
Also the siRNAs suggested by the online siRNA selection tools often are predicted to have off-targets
that belong to the same pathway or somehow related to the direct target. Customized siDesigner is a
siRNA selection tool with such functional off-target filtering that reduces the number of functionally
related off-targets by comparing their associated GO Process term by a text mining algorithm. Our
simple yet effective text mining algorithm enables identification of not only identical but also related
GO biological process terms. In future we are aiming to apply a more refined text mining algorithm for
10
identifying hierarchically related GO process terms. Our tool also gives the user a choice to filter out
the undesired off-targets. User can select a siRNA from a list of most potent candidates to see the
predicted off-target gene list from where he may select one or more off-targets to be filtered out. This
tool is the first in its kind that presents the user with such choices for customized specificity. In coming
days, this kind of functional off-target filtering of siRNAs may prove to be beneficial for use in cultured
cells to study gene functions.
AVAILABILITY
Customized siDesigner is freely available for download from http://gyanxet-beta.com.
SUPPLEMENTARY DATA
Supplementary Tables 1-4 lists the training and testing siRNA datasets from siRecords, Common
siRNAs found between MIT siRNA database and siRNAs designed by Customized siDesigner for 68
genes, Common siRNAs from Qiagen siRNA genome wide library having a hit score greater than 0.3
in the biogenesis screening project and siRNAs designed by Customized siDesigner, and siRNAs
from MIT siRNA database predicted to have functionally related off-targets, respectively.
ACKNOWLEDGEMENT
We express our gratitude to Prof. Urlike Kutay from Department of Biochemistry, ETH Zurich for
providing siRNA screening data from Biogenesis project. We also thank Sanga Mitra and Smarajit
Das from Indian Association for the Cultivation of Science for their helpful suggestions about this
paper.
FUNDING
This work was supported by Department of Science and Technology, Government of India.
REFERENCES
1.
Grosshans H. and Filipowicz W., (2008) The expanding world of small RNAs. Nature, 451,
414–416.
2. Fougerolles A. D., Vornlocher H.P, Maraganore J. and Lieberman J., (2007) Interfering with
disease: a progress report on siRNA-based therapeutics. Nat. Rev. Drug Discovery, 6, 443–
453.
3. Elbashir SM, Harborth J, Lendeckel W, Yalcin A, Weber K, Tuschl T, (2001) Duplexes of 21nucleotide RNAs mediate RNA interference in cultured mammalian cells. Nature, 411, 494–
498.
4. Pei Y. and Tuschl T., (2006) On the art of identifying effective and specific siRNAs. Nat.
Methods, 3, 670–676.
11
5. Reynolds, A., D. Leake, Q. Boese, S. Scaringe, W. S. Marshall, and A. Khorova, (2004)
Rational siRNA design for RNA interference. Nat. Biotechnology, 22, 326-330.
6. [http://www.rnaiweb.com/RNAi/siRNA_Design/index.html]
7. Wang L. and Forest Y. M., (2004) A Web-based design center for vector-based siRNA and
siRNA cassette. Bioinformatics, 20, 1818–1820.
8. Holen T., (2006) Efficient prediction of siRNAs with siRNArules 1.0: An open-source JAVA
approach to siRNA algorithms. Bioinformatics, 12, 1620–1625.
9. Yuan B., Latek R., Hossbach M., Tusch T. and Lewitter F., (2004) siRNA Selection Server: an
automated siRNA oligonucleotide prediction server. Nucleic Acids Res., 32, 130–134.
10. Chalk A. M, Wahlestedt C., and Sonnhammer E. L.L., (2004) Improved and automated
prediction of effective siRNA. Biochem. Biophys. Res. Commun., 319, 264–274.
11. Park Y. K., Park S. M., Choim Y. C., Lee D., Won M. and Kim Y. J., (2008) AsiDesigner:
exon-based siRNA design server considering alternative splicing. Nucleic Acids Res., 36, 97–
103.
12. AsiDesigner [http://sysbio.kribb.re.kr:8080/AsiDesigner/menuDesigner.jsf].
13. Ambion siRNA Target Finder [http://www.ambion.com/techlib/misc/siRNA_finder.html].
14. MicroSynth siRNA design [http://www.microsynth.ch/499.0.html].
15. IDT RNAi Design [http://www.idtdna.com/Scitools/Applications/RNAi/RNAi.aspx].
16. Imgenex siRNA tool [http://www.imgenex.com/sirna_tool.php].
17. BLOCK-iT RNAi Designer [https://rnaidesigner.invitrogen.com/rnaiexpress].
18. siSearch [http://sonnhammer.cgb.ki.se/siSearch/siSearch_1.7.html].
19. Promega siRNA Target Designer [http://www.promega.com/siRNADesigner/program/].
20. siDESIGN Center [http://www.dharmacon.com/sidesign].
21. siRNA Target Finder [https://www.genscript.com/ssl-bin/app/rnai].
22. Whitehead WI siRNA Selection Program [http://jura.wi.mit.edu/bioc/siRNAext/].
23. BIOPREDsi [http://www.biopredsi.org/].
24. siDirect [http://sidirect2.rnai.jp].
25. siDRM [http://sidrm.biolead.org/]
26. Jackson A.L, Linsley P.S, (2010) Recognizing and avoiding siRNA off-target effects for target
identification and therapeutic application. Nat. Rev. Drug Discov., 9, 57-67.
27. Human Refseq RNA database.
[ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.gbff.gz]
28. Judge D., Sood V., Shaw J.R., Fang D., McClintock K. and MacLachlan I., (2005) Sequencedependent stimulation of the mammalian innateimmune response by synthetic siRNA. Nat.
Biotechnol, 23, 457-462.
29. Fedorov Y, Anderson E.M, Birmingham A, Reynolds A, Karpilow J, Robinson K, Leake D,
Marshall W.S, Khvorova A., (2006) Off-target effects by siRNA can induce toxic phenotype.
RNA, 12, 1188-1196.
12
30. Platt J. C., (1999) Fast training of support vector machines using sequential minimal
optimization, Advances in kernel methods. MIT Press Cambridge, pp 185 – 208.
31. Jiang X. and Yu H., (2008) SVM-JAVA: A Java implementation of the SMO (Sequential
Minimal Optimization) for training SVM. [http://dm.postech.ac.kr/svm-java].
32. NCBI gene2go mapping database. [ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz].
33. Birmingham A, Anderson E.M, Reynolds A, Ilsley-Tyree D, Leake D, Fedorov Y, Baskerville S,
Maksimova E, Robinson K, Karpilow J, Marshall W.S, Khvorova A, (2006) 3' UTR seed
matches, but not overall identity, are associated with RNAi off-targets. Nat. Methods, 3, 199204.
34. Anderson E. M., A. Birmingham, S. Baskerville, A. Reynolds, E.Maksimova, D. Leake, Y.
Fedorov, J. Karpilow and A. Khvorova, (2008) Experimental validation of the importance of
seed complement frequency to siRNA specificity. RNA, 14, 853-861.
35. Schultz N, Marenstein D.R, De Angelis D.A, Wang W.Q, Nelander S, Jacobsen A, Marks D.S,
Massagué J, Sander C., (2011) Off-target effects dominate a large-scale RNAi screen for
modulators of the TGF-β pathway and reveal microRNA regulation of TGFBR2. Silence, 2.
36. Y. Ren, W. Gong, Q. Xu, X. Zheng, D. Lin, Y. Wang and T. Li, (2006) siRecords: an extensive
database of mammalian siRNAs with efficacy ratings. Bioinformatics, 22, 1027-1028.
37. Grillo G, Turi A, Licciulli F, Mignone F, Liuni S, Banfi S, Gennarino VA, Horner DS, Pavesi G,
Picardi E, Pesole G., (2010) UTRdb and UTRsite (RELEASE 2010): a collection of sequences
and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res.,
38, 75-80.
38. Huesken D, Lange J, Mickanin C, Weiler J, Asselbergs F, Warner J, Meloon B, Engel S,
Rosenberg A, Cohen D, Labow M, Reinhardt M, Natt F, Hall J., (2005) Design of a genomewide siRNA library using an artificial neural network. Nat. Biotechnol, 23, 995-1001.
39. MIT siRNA database [http://web.mit.edu/sirna/sirnas-human.html].
40. Wild T, Horvath P, Wyler E, Widmann B, Badertscher L, Zemp I, Kozak K, Csusc G, Lund E,
Kutay U, (2010) A protein inventory of human ribosome biogenesis reveals an essential
function of Exportin 5 in 60S subunit export. PLoS Biol, 8:e1000522.
41. Full gene ontology file, release date 06:03:2012, version 1.2, cvs version 1.2681.
42. Gene Ontology Consortium, (2001) Creating the gene ontology resource: design and
implementation. Genome Res., 11, 1425-33.
TABLE AND FIGURES LEGENDS
Table 1. Detailed parameter set of Customized siDesigner.
Stringent Parameters: (These parameters cannot be adjusted by the user)
Target starting site:
Within ORF region- 50nt downstream of the start codon AUG
13
Within 3’ UTR- 50 nt downstream of the end of CDS
The 21 nt target region is common for all alternatively spliced
isoforms of the target gene
Alternative splicing:
Adjustable Parameters: (User can select or deselect these parameters to apply or not
apply these rules stringently)
Target starts with:
AA di nucleotide
GC content:
30% to 60%
Thermodynamic difference
of the sense and antisense
strand 5’ end:
Sense strand 5’ end (bases 1-6 of sense strand) binding
energy- antisense strand 5’ end (bases 14-19) binding
energy<0 (binding energy is calculated by base stacking
energy using nearest neighbour algorithm)
Presence
of
stimulatory motif:
immune
1) No presence of “GUCCUUCAA” in the antisense
strand (which is RISC entering)
2) GU content<65%
SNP sites within the target
region:
No SNP sites within the target region
Homology
mRNAs:
No more than 16 contiguous base homology with non-target
mRNAs
with
other
miRNA like off-targets:
Number of miRNA like off targets less than 100
Functional
filtering:
Number of off-target genes having a related GO Process term
with the target gene<5
off-target
Parameters for training the SVM classifier
GC content of the 19 nt core region of the target site (in %)
GU content of the 19 nt core region of the target site (in %)
Presence of a stretch of 4 identical nucleotides (1 or 0)
Presence of immune stimulatory motif GUCCUUCAA in the antisense strand (1 or 0)
Thermodynamic difference of the sense and antisense strand 5’ end
14
Number of A/U in the 14th-19th position in the sense strand (in %)
Number of G/C in the 1st-6th position in the sense strand (in %)
Number of presence of position specific nucleotide features - Presence of A at Position 1 +
Presence of A or U at position 18 + Presence of C at position 8 + Presence of any base other
than G at position 16
Table 2. siRNAs designed for the gene MMP11 by five different online siRNA designing tool with their
predicted off targets that have related GO Process with the direct target
siRNA selection tool
Predicted off
siRNA sense strand
Related GO Process
target
siDesign centre
GCACACAACAGCAGCCAAG
CASP14
proteolysis
SHROOM4
multicellular
(Dharmacon)
organismal
development
BiopredSi (Qiagen)
CCTCCAAAGCCATTGTAAA
VANGL2
multicellular
organismal
development
SATB2
multicellular
organismal
development
IKZF1
multicellular
organismal
development
SPRED1
multicellular
organismal
development
15
WI siRNA selection
AATGTGTGTACAGTGTGTATA
PAFAH1B1
multicellular
server (Whitehead
organismal
institute)
development
Genscript siRNA target
PCSK2
proteolysis
AAGGTATGGAGCGATGTGACG
JOSD1
proteolysis
GTTGTTTTCTAAAAAAAAAAA
LTBP4
Regulation of
finder (Genscript)
siDirect
proteolysis
DISC1
multicellular
organismal
development
EBF2
multicellular
organismal
development
GSK3B
multicellular
organismal
development
16
Figure 1. Workflow of the siRNA designing algorithm.
Figure 2. Illustration of the text mining algorithm used for finding related GO biological process terms
17
Fig 3. (a) Sensitivity and specificity of the classifier for different threshold of knockdown efficiency in
Huesken dataset. The sensitivity increases for higher knockdown efficiency score
(0.4>0.5>0.6>0.7>0.8) while the specificity decreases.
(b) Increase in specificity for top scoring siRNA rated by the algorithm (threshold for knockdown
efficiency kept as 0.6). Specificity increases greatly for top 30%, 20% and 10% siRNAs.
18
Fig 4. A screenshot of Customized siDesigner user interface
19
Download