Integrating R into Discovery Net System Qiang Lu, Xinzhong Li {qianglu, xinzhong}@doc.ic.ac.uk Dept. of Computing, Imperial College, 180 Queens Gate, London, SW7 2RH, UK Moustafa Ghanem, Yike Guo {mmg, yg}@inforsense.com 459a Fulham Road Chelsea,London SW10 9UZ, UK Haiyan Pan {hypan}@scbit.org Shanghai Center for Bioinformation Technology, Shanghai, 210235, China. Abstract Discovery Net system, which is a workflow-based distributing computing environment, permits various tools to be integrated and thus provides a high-performance platform for data mining and knowledge discovering. As we know, the bioinformatics research field extends rapidly. Hundreds of various algorithms and software are developed every year. The unique capability of integration makes Discovery Net System an ideal uniform platform for bioinformatics research, where comprehensive and systematic analysis is always needed. As an open-source statistical tool, R is becoming very popular in bioinformatics research, especially in the field of microarray data analysis. Therefore, integrating R into Discovery Net system is of great significance. In this paper, we successfully developed a framework, with which R functions can be easily integrated into Discovery Net System without any further programming. Here, the methodology is illustrated, and an application instance for the domain of Microarray Analysis is demonstrated as well. 1. Introduction Bioinformatics is a rapid growing research field, where thousands of papers are produced every year denoting new problems and new strategies in various aspects. Obviously, it is very hard for any software to handle all these problems, especially using up-to-date methodologies. Therefore, it is the normal case that some research centers and big pharmaceutical companies have several academic or commercial software systems even with overlap functions to help their daily data management and analysis. How to communicate between different systems actually becomes a challenge. The Discovery Net, one of EPSRC's six pilot projects (http://www.lesc.ic.ac.uk/) provides such an integrative analysis platform, Discovery Net System, which is a workflow-based distributing computing environment, permitting various tools to be integrated. The Discovery Net middleware is written in Java language with the technique of J2EE. In the system, every distributed (local or remote) algorithm/function is incorporated as a pluggable component (a node). A webstart enabled client provides a portal for user to access and manage the computation services. In the portal, not only functionalities for submitting task, retrieving result and managing workflow are provided, but also those for interactive visualization are included. Rowe A. et al. (2003), Jameel et al.(2004), and Curcin V. et al.(2004) have illustrated the Discovery Net System in detail and have demonstrated several application instances in genome research. Recently, an open source program, R (http://www.r-project.org/) is becoming very popular in bioinformatics research. It provides a wide variety of statistical algorithms (linear nonlinear modeling, classical statistical tests, time- series analysis, classification, clustering, microarray analysis, and so on) and graphical facilities. Moreover, since R itself is free and high extensible, lots of life science projects in academics have been contributing to R, extending it to a lot of related areas quickly. So far, in R, not only its initial subject of statistics, but also some bioinformatics problems have been included and provided with exceptional efficient solutions. Considering R’s powerful functionality, some commercial data analysis packages such as GeneSpring, Spotfire and Rossetta Resolver, and some open source programs such as Gaggle, EBI Expression profiler, RACE (Psarros M. et al. 2005) have already integrated R into their latest versions. Here we introduce the integration of R into Discovery Net System, as well as its usage on the field of microarray expression data analysis. The most significant point of this integration of R is that, a generic framework is provided, with which no further programming is needed when user integrates R functions or subroutines by themselves. The integration can be done in a couple of minutes. 2. Method R is the language and environment having two significant functionalities: 1) statistical computing and 2) interactive graphics. Having R in Discovery Net System, naturally, not only the powerful statistical algorithms but also the interactive graphical facilities is expected. 2.1 Integration of Algorithms In general, algorithms in R are described in S language. R Software itself provides an engine to execute these scripts, and a GUI to interact with users. To integrate its algorithms, viz. invoke R functions from Discovery Net System, R API is used, which is written in C language and provided as a share library. Since the Discovery Net System framework is built with Java language, Java Native Interface (JNI) is introduced to fill the gap in invocation stack, as showed in Figure 1. Standing beside the invocation stack, the more significant point is the generic framework, General-R, which facilitates the integration. The main idea of this generic framework is a XML file, viz. i3xml (XML implemented interface 3 of WfMC Workflow Reference Model), which has the format of Web Services Description Language (WSDL), describing the R functions to fit node specification of Discovery Net System. Figure.1 Architecture of Bridging Discovery Net and R, where r/w means read and write respectively. Using WSDL, an R function can be described, the input and output parameters as messages, while the functions (name) as operations. With the support of XML Schema Definition (XSD), the types of input and output messages are flexible. Not only some basic types such as Double, Integer, String, and Boolean, but also some complex types such as array and matrix can be used. Moreover, to transfer raw dataset, a special data type, REXP (R EXPression), is introduced. To operations, three parts, init, action and final are designed to describe R function. They are supposed to be invoked by R engine in succession. For example as that in table 1, the scripts of init.R and final.R under the folder of preprocessAffy/expresso are invoked before the action of expresso(). <wsdl:operation name="Expresso"> <kde:init script="preprocessAffy/expresso/init"/> <kde:final script="preprocessAffy/expresso/final"/> <kde:operation action="eset_tmp&lt;expresso(abatch,bgcorrect.method=bgcorrect,normalize.metho d=normalize,pmcorrect.method=pmcorrect,summary.method= summary)"/> </wsdl:operation> Table 1 Example of Operation, where Bioconductor function Expresso is integrated. After the i3xml gives the description of R functions, what the framework of General-R does is to read and interpret the description in the i3xml file, and thus invoke R engine to execute corresponding R scripts. In Discovery Net System, the components such as those for algorithms are specified with input data/metadata, output data/metadata and parameters. A run task in Discovery Net System is organized as a workflow. Theoretically, the input is from previous component, while the output is to the next component. The parameters are typed-in before execution. To describe an R function as a Discovery Net System component, inputs, outputs and parameters need to be defined. Generally, the inputs, outputs and parameters are related to the input and output messages defined in the i3xml file. Obviously, to edit the XML file by text editor is an exhausting work. Considering this, an editor for i3xml file, XML Wrapper, is developed. 2.2 Integration of Interactive Graphics With the above integration of algorithm, any graphical results created by R functions can be obtained by defining the result with graphical type such as JPGPicture, PostscriptPicture or PNGPicture. However, this method means the properties of graphics such as axes, color, and legend etc. should be defined in advance. Obviously it is not convenient enough, especially for exploring the dataset interactively. Figure 2 Integration of Inteactive Graphics Considering the above disadvantage, R is integrated into Discovery Net System client as well. To facilitate the integration, the 3rd party software, JGR, is adopted. JGR (Java Gui for R) includes interactive interface to edit and input Java script. It uses a Java graphics device, JavaGD, in which all painting functions of R are delegated to the Java class. After add some functions for communicating between Discovery Net System and R, such as loading data from and saving figures to Discovery Net System, the function of interactive graphics is implemented as Expert-R node, as Figure 2 shows. 3 Application Wellcome Trust Functional Genomics Development Initiative funds a programme: Biological Atlas of Insulin Resistance (BAIR, http://www.bair.org.uk) trying to address the mechanism of insulin resistance. Discovery Net System is chosen to be the main platform for its daily data analysis. Here, we give an example of how to deal with IRS2 knockout mice expression data by the R integration, which is a part of BAIR research. Figure 3 gives the analysis workflow. There are 7 IRS2 knockout and 6 wildtype Affymetrix MOE430v2 chips in this application. We chose standard RMA normalization approach. MAS5 present/absent call is calculated as well. Data is 2-based logarithm transformed. Multiple test was applied for FDR test after Welch’s t-test. The predefined Bonferroni and Benjamini & Hochberg test and Storey’s FDR in R multi-test package were performed. After applying Benjamini & Hochberg FDR test, we found that no probeset can pass the test of FDR < 0.05, while only 9 probesets pass that of FDR<0.25. Among these 9 probesets, 3 probesets are absent cross all 13 chips, one probeset does not change enough. (FoldChange =1.03). The other 5 genes, Gpd2, Atp1b1, Pak1ip1 and Cdkn1b about 30% down regulated, and Dnpep about 60% up regulated, remain significantly expressed between IRS2 knockout and wildtype mice. Among these five genes, Cdkn1b described as protein p27(Kip1), regulates cell cycle progression in mammals by inhibiting the activity of cyclindependent kinases (CDKs). It is confirmed by other researchers (T. Uchida, 2005) that deletion of Cdkn1b ameliorates hyperglycemia by maintaining compensatory hyperinsulinemia in diabetic mice, thus, p27(Kip1) contributes to beta-cell failure during the development of type 2 diabetes in Irs2 knockout mice and represents a potential new target for the treatment of this condition. Because nearly no single gene passed multiple comparison, it’s necessary to identify groups of genes with similar regulation between IRS2 knockout and wildtype. First we got 586 probesets (represent 524 unique genes) by welch’s t-test < 0.01, then by using Onto-Express with (http://vortex.cs.wayne.edu/ontoexpress/) multiple correction testing, which indicated that RNA binding (P=0.0), endocytosis (P=5.0E-5), protein modification (P=3.3E-4), nucleus (P=5.4E4), perimuclear region (P=6.8E-4), vesicle- Figure 3 Workflows with R integration for BAIR, where ProprocessAffy, ttest, FDR are nodes with General-R framework, while Expert-R is the interactive R environment integrated. The ViewExpert-R is the figures created. Other nodes, such as Boxplot, Filter, Derive, Join are some other Knowledge Discovery Network nodes. mediated transport (P=0.002) and lipid biosynthesis (P=0.003) ontology groups were represented to a significantly greater extent in wildtype vs. IRS2 knockout than expected if ontology groups were randomly distributed within the list of these 524 top genes. 4. Summary As we have seen, with the general-R framework we have developed, the user can easily integrate the R functions or subroutines into Discovery Net System. So far, many R functions in package Bioconductor have been integrated, which empowers the Discovery Net System for microarray analysis. However, some improvements are proposed. Firstly, instead of using XML editor to integrate, the method of annotation is proposed. With adding annotations into R code as comments, the input, output messages and operations are defined. The general-R framework will parse R script to get the integration information including message name, type, and operation etc. and thus integrate them into Discovery Net System automatically. Secondly, with the introduction of annotation, the client-side expert-R will act as not only an interactive graphics tool, but also a console for integration and debug environment. Finally, more popular R packages/functions will be integrated. Acknowledgements The authors would like to thank all other members working for the project of Discovery Net. Thanks to Dr. Alex Michie in Inforsense Ltd for good suggestions. Thanks to BAIR project funded by Wellcome Trust, and Discovery Net project funded under the UK e-Science Programme. Reference 1. 2. 3. 4. 5. 6. Curcin V., Ghanem M., Guo Y., et al., (2004) SARS Analysis on the Grid. http://www.allhands.org.uk/submissions/papers/80 .pdf JGR, http://stats.math.uni-augsburg.de/JGR/ Psarros M., Heberl S., Sick M., et.al., (2005) RACE: Remote Analysis Computation for gene Expression data, Nucleic Acids Res. 33 (Web Server issue):W638-43. Rowe A., Kalaitzopoulos D., Osmond M., et al., (2003), The discovery net system for high throughput bioinformation, Bioinformatics, 19 (supp), i225-i231. Syed J., Ghanem M., and Y. Guo (2004), Discovery Processes in Discovery Net, http://www.allhands.org.uk/submissions/papers/11 0.pdf. Uchida T., Nakamura T., Hashimoto N., et al., (2005) Deletion of Cdkn1b ameliorates hyperglycemia by maintaining compensatory hyperinsulinemia in diabetic mice. Nat Med., 11(2):175-82.