Integrating R into Discovery Net System

advertisement
Integrating R into Discovery Net System
Qiang Lu, Xinzhong Li
{qianglu, xinzhong}@doc.ic.ac.uk
Dept. of Computing, Imperial College, 180 Queens Gate, London, SW7 2RH, UK
Moustafa Ghanem, Yike Guo
{mmg, yg}@inforsense.com
459a Fulham Road Chelsea,London SW10 9UZ, UK
Haiyan Pan
{hypan}@scbit.org
Shanghai Center for Bioinformation Technology, Shanghai, 210235, China.
Abstract
Discovery Net system, which is a workflow-based distributing computing environment,
permits various tools to be integrated and thus provides a high-performance platform for
data mining and knowledge discovering. As we know, the bioinformatics research field
extends rapidly. Hundreds of various algorithms and software are developed every year.
The unique capability of integration makes Discovery Net System an ideal uniform
platform for bioinformatics research, where comprehensive and systematic analysis is
always needed. As an open-source statistical tool, R is becoming very popular in
bioinformatics research, especially in the field of microarray data analysis. Therefore,
integrating R into Discovery Net system is of great significance. In this paper, we
successfully developed a framework, with which R functions can be easily integrated into
Discovery Net System without any further programming. Here, the methodology is
illustrated, and an application instance for the domain of Microarray Analysis is
demonstrated as well.
1. Introduction
Bioinformatics is a rapid growing research field,
where thousands of papers are produced every year
denoting new problems and new strategies in
various aspects. Obviously, it is very hard for any
software to handle all these problems, especially
using up-to-date methodologies. Therefore, it is
the normal case that some research centers and big
pharmaceutical companies have several academic
or commercial software systems even with overlap
functions to help their daily data management and
analysis. How to communicate between different
systems actually becomes a challenge. The
Discovery Net, one of EPSRC's six pilot projects
(http://www.lesc.ic.ac.uk/) provides such an
integrative analysis platform, Discovery Net
System, which is a workflow-based distributing
computing environment, permitting various tools
to be integrated. The Discovery Net middleware is
written in Java language with the technique of
J2EE. In the system, every distributed (local or
remote) algorithm/function is incorporated as a
pluggable component (a node). A webstart enabled
client provides a portal for user to access and
manage the computation services. In the portal, not
only functionalities for submitting task, retrieving
result and managing workflow are provided, but
also those for interactive visualization are
included. Rowe A. et al. (2003), Jameel et
al.(2004), and Curcin V. et al.(2004) have
illustrated the Discovery Net System in detail and
have demonstrated several application instances in
genome research.
Recently, an open source program, R
(http://www.r-project.org/) is becoming very
popular in bioinformatics research. It provides a
wide variety of statistical algorithms (linear
nonlinear modeling, classical statistical tests, time-
series analysis, classification, clustering, microarray analysis, and so on) and graphical facilities.
Moreover, since R itself is free and high
extensible, lots of life science projects in
academics have been contributing to R, extending
it to a lot of related areas quickly. So far, in R, not
only its initial subject of statistics, but also some
bioinformatics problems have been included and
provided with exceptional efficient solutions.
Considering R’s powerful functionality, some
commercial data analysis packages such as
GeneSpring, Spotfire and Rossetta Resolver, and
some open source programs such as Gaggle, EBI
Expression profiler, RACE (Psarros M. et al. 2005)
have already integrated R into their latest versions.
Here we introduce the integration of R into
Discovery Net System, as well as its usage on the
field of microarray expression data analysis. The
most significant point of this integration of R is
that, a generic framework is provided, with which
no further programming is needed when user
integrates R functions or subroutines by
themselves. The integration can be done in a
couple of minutes.
2. Method
R is the language and environment having two
significant functionalities: 1) statistical computing
and 2) interactive graphics. Having R in Discovery
Net System, naturally, not only the powerful
statistical algorithms but also the interactive
graphical facilities is expected.
2.1 Integration of Algorithms
In general, algorithms in R are described in S
language. R Software itself provides an engine to
execute these scripts, and a GUI to interact with
users. To integrate its algorithms, viz. invoke R
functions from Discovery Net System, R API is
used, which is written in C language and provided
as a share library. Since the Discovery Net System
framework is built with Java language, Java Native
Interface (JNI) is introduced to fill the gap in
invocation stack, as showed in Figure 1.
Standing beside the invocation stack, the more
significant point is the generic framework,
General-R, which facilitates the integration. The
main idea of this generic framework is a XML file,
viz. i3xml (XML implemented interface 3 of
WfMC Workflow Reference Model), which has
the format of Web Services Description Language
(WSDL), describing the R functions to fit node
specification of Discovery Net System.
Figure.1 Architecture of Bridging Discovery Net and R,
where r/w means read and write respectively.
Using WSDL, an R function can be described,
the input and output parameters as messages, while
the functions (name) as operations.
With the support of XML Schema Definition
(XSD), the types of input and output messages are
flexible. Not only some basic types such as
Double, Integer, String, and Boolean, but also
some complex types such as array and matrix can
be used. Moreover, to transfer raw dataset, a
special data type, REXP (R EXPression), is
introduced.
To operations, three parts, init, action and
final are designed to describe R function. They are
supposed to be invoked by R engine in succession.
For example as that in table 1, the scripts of init.R
and
final.R
under
the
folder
of
preprocessAffy/expresso are invoked before the
action of expresso().
<wsdl:operation name="Expresso">
<kde:init script="preprocessAffy/expresso/init"/>
<kde:final script="preprocessAffy/expresso/final"/>
<kde:operation action="eset_tmp<expresso(abatch,bgcorrect.method=bgcorrect,normalize.metho
d=normalize,pmcorrect.method=pmcorrect,summary.method=
summary)"/>
</wsdl:operation>
Table 1 Example of Operation, where Bioconductor
function Expresso is integrated.
After the i3xml gives the description of R
functions, what the framework of General-R does
is to read and interpret the description in the i3xml
file, and thus invoke R engine to execute
corresponding R scripts.
In Discovery Net System, the components
such as those for algorithms are specified with
input data/metadata, output data/metadata and
parameters. A run task in Discovery Net System is
organized as a workflow. Theoretically, the input
is from previous component, while the output is to
the next component. The parameters are typed-in
before execution. To describe an R function as a
Discovery Net System component, inputs, outputs
and parameters need to be defined. Generally, the
inputs, outputs and parameters are related to the
input and output messages defined in the i3xml
file.
Obviously, to edit the XML file by text editor
is an exhausting work. Considering this, an editor
for i3xml file, XML Wrapper, is developed.
2.2 Integration of Interactive Graphics
With the above integration of algorithm, any
graphical results created by R functions can be
obtained by defining the result with graphical type
such as JPGPicture, PostscriptPicture or
PNGPicture. However, this method means the
properties of graphics such as axes, color, and
legend etc. should be defined in advance.
Obviously it is not convenient enough, especially
for exploring the dataset interactively.
Figure 2 Integration of Inteactive Graphics
Considering the above disadvantage, R is
integrated into Discovery Net System client as
well. To facilitate the integration, the 3rd party
software, JGR, is adopted. JGR (Java Gui for R)
includes interactive interface to edit and input Java
script. It uses a Java graphics device, JavaGD, in
which all painting functions of R are delegated to
the Java class. After add some functions for
communicating between Discovery Net System
and R, such as loading data from and saving
figures to Discovery Net System, the function of
interactive graphics is implemented as Expert-R
node, as Figure 2 shows.
3 Application
Wellcome
Trust
Functional
Genomics
Development Initiative funds a programme:
Biological Atlas of Insulin Resistance (BAIR,
http://www.bair.org.uk) trying to address the
mechanism of insulin resistance. Discovery Net
System is chosen to be the main platform for its
daily data analysis.
Here, we give an example of how to deal with
IRS2 knockout mice expression data by the R
integration, which is a part of BAIR research.
Figure 3 gives the analysis workflow.
There are 7 IRS2 knockout and 6 wildtype
Affymetrix MOE430v2 chips in this application.
We chose standard RMA normalization approach.
MAS5 present/absent call is calculated as well.
Data is 2-based logarithm transformed. Multiple
test was applied for FDR test after Welch’s t-test.
The predefined Bonferroni and Benjamini &
Hochberg test and Storey’s FDR in R multi-test
package were performed.
After applying Benjamini & Hochberg FDR
test, we found that no probeset can pass the test of
FDR < 0.05, while only 9 probesets pass that of
FDR<0.25. Among these 9 probesets, 3 probesets
are absent cross all 13 chips, one probeset does not
change enough. (FoldChange =1.03). The other 5
genes, Gpd2, Atp1b1, Pak1ip1 and Cdkn1b about
30% down regulated, and Dnpep about 60% up
regulated, remain significantly expressed between
IRS2 knockout and wildtype mice. Among these
five genes, Cdkn1b described as protein
p27(Kip1), regulates cell cycle progression in
mammals by inhibiting the activity of cyclindependent kinases (CDKs). It is confirmed by
other researchers (T. Uchida, 2005) that deletion of
Cdkn1b ameliorates hyperglycemia by maintaining
compensatory hyperinsulinemia in diabetic mice,
thus, p27(Kip1) contributes to beta-cell failure
during the development of type 2 diabetes in Irs2
knockout mice and represents a potential new
target for the treatment of this condition.
Because nearly no single gene passed multiple
comparison, it’s necessary to identify groups of
genes with similar regulation between IRS2
knockout and wildtype. First we got 586 probesets
(represent 524 unique genes) by welch’s t-test <
0.01,
then
by
using
Onto-Express
with
(http://vortex.cs.wayne.edu/ontoexpress/)
multiple correction testing, which indicated that
RNA binding (P=0.0), endocytosis (P=5.0E-5),
protein modification (P=3.3E-4), nucleus (P=5.4E4), perimuclear region (P=6.8E-4), vesicle-
Figure 3 Workflows with R integration for BAIR, where ProprocessAffy, ttest, FDR are nodes with General-R
framework, while Expert-R is the interactive R environment integrated. The ViewExpert-R is the figures created.
Other nodes, such as Boxplot, Filter, Derive, Join are some other Knowledge Discovery Network nodes.
mediated
transport (P=0.002) and lipid
biosynthesis (P=0.003) ontology groups were
represented to a significantly greater extent in
wildtype vs. IRS2 knockout than expected if
ontology groups were randomly distributed within
the list of these 524 top genes.
4. Summary
As we have seen, with the general-R framework
we have developed, the user can easily integrate
the R functions or subroutines into Discovery Net
System. So far, many R functions in package
Bioconductor have been integrated, which
empowers the Discovery Net System for microarray analysis.
However, some improvements are proposed.
Firstly, instead of using XML editor to integrate,
the method of annotation is proposed. With adding
annotations into R code as comments, the input,
output messages and operations are defined. The
general-R framework will parse R script to get the
integration information including message name,
type, and operation etc. and thus integrate them
into Discovery Net System automatically.
Secondly, with the introduction of annotation,
the client-side expert-R will act as not only an
interactive graphics tool, but also a console for
integration and debug environment. Finally, more
popular R packages/functions will be integrated.
Acknowledgements
The authors would like to thank all other members
working for the project of Discovery Net. Thanks
to Dr. Alex Michie in Inforsense Ltd for good
suggestions. Thanks to BAIR project funded by
Wellcome Trust, and Discovery Net project funded
under the UK e-Science Programme.
Reference
1.
2.
3.
4.
5.
6.
Curcin V., Ghanem M., Guo Y., et al., (2004)
SARS Analysis on the Grid.
http://www.allhands.org.uk/submissions/papers/80
.pdf
JGR, http://stats.math.uni-augsburg.de/JGR/
Psarros M., Heberl S., Sick M., et.al., (2005)
RACE: Remote Analysis Computation for gene
Expression data, Nucleic Acids Res. 33 (Web
Server issue):W638-43.
Rowe A., Kalaitzopoulos D., Osmond M., et al.,
(2003), The discovery net system for high
throughput bioinformation, Bioinformatics, 19
(supp), i225-i231.
Syed J., Ghanem M., and Y. Guo (2004),
Discovery Processes in Discovery Net,
http://www.allhands.org.uk/submissions/papers/11
0.pdf.
Uchida T., Nakamura T., Hashimoto N., et al.,
(2005) Deletion of Cdkn1b ameliorates
hyperglycemia by maintaining compensatory
hyperinsulinemia in diabetic mice. Nat Med.,
11(2):175-82.
Download