Workflow Systems in Bioinformatics and the Bioinformatics Educational Grid Tan Tin Wee Associate Professor National University of Singapore tinwee@bic.nus.edu.sg Shoba Ranganathan, Victor Tong, Justin Choo, Richard Tan, G.S.Ong, Simon See, TS Lim, Mark de Silva and KSLim. International Symposium on Grid Computing ISGC2004 “Making the World Wide Grid a Reality” 27 July 2004 In a Nutshell Weaving several threads of development in Bioinformatics such as Workflow Integration and DataGrid (over the past 5 yrs or so) to build an integrated educational grid “info”structure that will support HR development, education, training, self-learning etc in the emerging discipline of bioinformatics for conventional as well as “E-” eduation Making the World Wide Grid a reality: Contribution of Bioinformatics Bioinformatics is the science of using information and ICT to understand biology Despite being driven by rapid progress in allied disciplines in the “New Biology”: genomics, proteomics, metabolomics, transcriptomics, other ‘omics, computational biology, systems biology generating unprecedented volumes of data Grid computing is not yet ubiquitous in life sciences In Vitro In Vivo In Situ In Silico Biology And Personalised Medicine Imaging Modeling Simulation Theoretical Biology D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell., 100(1):57–70 Review, 2000 “Tools have changed, but the job hasn’t” Cartoons from talk by Rozhan Mohammed Idrus & Hanafi Atan, Universiti Sains Malaysia, APAN 2003 Bioinformatics - Emergent and almost pervasive in all biological and life science disciplines Computational Demands and Data Processing in Life Sciences are expanding! ‘omics Genomics Proteomics Bioinformatics Computational Biology Medical Informatics BioStatistics LIFE SCIENCE INFORMATICS LIFE SCIENCES and HEALTH SCIENCES Where does the Grid fit in? Life Science Informatics BIOTECHNOLOGY and NEW BIOLOGY INFOCOMMUNICATIONS TECHNOLOGY Why no Grid here yet? Lack of widespread awareness and training in computational skills in the life sciences community Few computational, networking and grid computing experts with first hand domain knowledge in life sciences Data-intensive nature of life science grid computing applications Labour-intensive nature of building life science grids Lack of Killer Applications Bioinformatics is a Rapidly changing target Biotech and InfoComm Technology - Parallel Growth Systems Worm Genome Microbial Genomes Genome Project 1990 92 Dolly & DNA Human chips Genome Biology Biotechnology 94 96 98 2000 2002 BioX 2004 InfoCommunication Technology Dotcom boom And crash Wais Gopher Lambda Networking Grid Computing Internet 2 WWW boom Java ISP Grids applied to Life Sciences Internet2 demos of late 1990s Quasi-realtime data collection from synchrotrons for 3D structure determination iGrid98, SC’98, SC’99, SC2003 (most geographically dispersed grid computing award – arthropod phylogenetics) Anthrax research – United Devices Encyclopedia of Life (EOL) OBIGrid Kansai BioGrid Large scale mega projects Not a WorldWide Grid iGrid98 SC’98 http://www.startap.net/startap/igrid98/maxLikeAnApbionet98.html INET’99 demo http://www.bic.nus.edu.sg/admin/News/Jun99/inet/inet99.html http://www.startap.net/startap/APPLICATIONS/collabForStruct.html When will World Wide Grid be a reality for Life sciences? Like World Wide Web – everyone uses it, from publication and accessing the content Plug and play: Tap computational cycles anywhere from everywhere anytime Secure to use Killer application like Mosaic in 1993 Generate meaningful results Control key tools and automate mundane processes Connect people, computation, data, instruments Focus on two key areas Grid-enabled bioinformatics workflows systems as the killer application Building a bioinformatics educational grid Workflow integration 1996/7 Java based FlowBot project 1998 Inet98 Internet Flowbot Protocol http://www.isoc.org/inet98/proceedings/8x/8x_1.htm 1998 Application to Life Sciences – Workflow Integration – BIC-CNPR joint project – Lim et al 1998 PSB’98 From Sequence to Structure to Literature: The protocol approach to Bioinformation Wu et al spinoff company GeneticXchange.com 2001 Spinoff Company KOOPrime Pte Ltd 2002 BioWorldWideWorkFlow initiative in APBioNet Workflow integration is the Killer Application for a World Wide Life Science Grid! Bioinformatics Educational Grid 2001 - S* Life Sciences Informatics Alliance – 3 years of experience in online bioinformatics education: 5 courses and >1000 persons worldwide trained in basic bioinformatics Team of Online Teaching Assistants Workshop on Education in Bioinformatics: WEB01, WEB02, WEB03, WEB04 2004 – Problem Based Learning PBL in Bioinformatics online using emeet.nus.edu.sg 2004 – Building the Bioinformatics Educational Grid Education is the answer to making the World Wide Grid a reality Background Biologists and Biotechnologists need to be equipped and trained to carry out tomorrow’s biological research today! Integration of Network Infrastructure Databases Software Computational Grid Online educational and teaching and learning materials Education + Killer Application 1993 1. Network Infrastructure APAN Advanced Research Network 1996-2004 Internet2 and beyond 1st Country outside North America to connect: SINGAREN – Singapore Advanced Research and Education Network TANET2 from Taiwan and APAN-Transpac were next. Then Abilene…. … Today’s Starlight and Lambda networking 2. Databases Key major databases - 1.5 Terabytes today! Publicly accessible data over the Internet doubling every 12 to 18 months http://www.bio-mirror.net/ Mirroring Moore’s Law for chip technology BIODATABASES Genbank Genbank Genomes InterPro PDB BlastDB BLOCKS DDBJ PIR EMBL PFAM ENZYME REBASE PROSITE NCBI REFSEQ SRC SWISSPROT Taxonomy TrEMBL UniGene euGenes BioDataGrid: Registry of Databases NUS BioDataGrid initiative everest.bic.nus.edu.sg/lsdb Singapore National Grid Office has a new initiative – to be announced soon. Facilitate varying levels of granularity of access to structured and unstructured biological data 3. Software APBioBox project Funded by IDRC Pan Asia Networking R&D Grant Rapid and Easy Replication of Grid enabled software crucial to grid growth 3. Software - APBioBox Funded by International Development Research Centre of Canada, under their PAN Pan Asia Networking ICT grant To build an easily installable, widely and freely accessible, integrated suite of bioinformatics applications to faciliate training and research amongst biologists in developing countries A/P Tan Tin Wee, National University of Singapore Adjunct Professor Shoba Ranganathan, NUS and Chair Professor, Macquarie University, Sydney Ong Guan Sin, Consultant programmer, Singapore Computer Systems Pte Ltd 3. Software - APBioBox Shrink-wrapped bundle of some 300 software applications used in bioinformatics Preconfigured and integrated 15 mins to install on a Linux RedHat9 platform which typically takes several weeks to set up. Partnered with Sun Microsystem to come up with Bio-Cluster Grid, the equivalent in Sun Solaris platform. CDROMs and Downloadable http://www.apbionet.org/apbiogrid/apbiobox 3. Software – APBioBox appls Logical Abstraction through Java Wrappers built for: EMBOSS ~160 applications PHYLIP ~30 applications HMMER CLUSTALW BLAST FASTA, SSEARCH (in progress) MySQL SRS (Lion Bioscience) Globus Grid Toolkit 2.4 Unix Utilities KOOPlite Key Bioinformatics Databases (in progress) BioBox for Solaris Grid Engine Portal 4. Computational Grid APBioGrid 2002 To faciliate the building of a shared computational grid resources for the Asia Pacific region. APBioGRID Project CRAY APBioGrid Aims to provide computational resources to bioinformaticians and biological researchers to facilitate education and research through sharing each other’s computers over the Grid Why APBioNet Grid is needed? Large-scale [life] science [..] are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed. The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale [life] science […]. Altered from Bill Johnston 27 July 01 Why the “Grid”? 1998: advent of Grid Computing – distributed computing E.g. Tapping idle CPU cycles globally in the SETI project or the Anthrax online projects. “Like tapping electrons from the power grid, just plug in the appliance into the socket” Currently, one of the hottest areas in ICT. So the basis for BioGrids has been laid 5. Online Learning Material Eight institutions from 5 continents since 2001 – The S* Life Science Informatics Alliance Sweden University of Uppsala Karolinsk a Institutet USA Stanford University University of California, San Diego National University of Singapore Singapore Australia University of Sydney South Africa University of the Western Cape Macquarie University Sample Lecture- Slide View Wide Range of S* learning materials - Tutorial ppt presentation materials on introductory bioinformatics - Frequently Asked Questions in Forum discussion archives - Overview lectures on: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Introductory Molecular Biology An Overview of the Computational Analysis of Biological Sequences Transcript Analysis and Reconstruction Comparative Genomics Representations and Algorithms for Computational Molecular Biology Protein Structure Primer, Structure Prediction and Protein Physics Genomics and Computational Molecular Biology Genomics Protein and Nucleic Acid Structure, Dynamics,and Engineering Proteomics and Proteomes Structure Prediction for Macromolecular Interactions Protein - Ligand Modeling Microarray informatics Goals of S* • Provide a GLObal Bioinformatics Unified Learning Environment (GLOBULE) made up of modular courses in the disciplines of bioinformatics, medical informatics and genomics • Provide accessibility to the highest possible quality of online courseware approved by the educators from the host institutions. • Develop an integrated modular learning environment that allows a student to select from both pre-requisite modules and advanced modules in order to build a comprehensive program. S* course-3: by country S* course Growing List of Participants’ Countries S* Geographical Comparison Participants Distribution In 1st Course Compared Against In 3rd Course 70 60 50 40 30 20 10 0 australasia 1st Course 3rd Course asia north america europe africa south america Feedback Pretty good. A few rough edges but I'm sure you'll work them out over time. I really enjoyed it. Most of the lectures were very well presented and the participants in the forums helpful. I'm very impressed at the amount of work that has obviously gone into setting up the course. ~ Alan Wardroper, Thailand The international participation of the lecturers and students. The relevance of the field of bioinformatics in meeting the biomedical needs of today. The level of communication provided by the IVLE system enhanced learning considerably. The range of professional and academic background of students. The technical support provided by SStar was rapid and efficient to queries. ~ C.A.O. IDOWU, England Feedback To think that a world-class, web based education with such valued lectures is brought to your desk free of cost is impossible elsewhere. The course was wonderfully well managed. Our requests and problems were quickly and well attended to. I had a great time doing this course and thank the S*STAR team whole heartedly for making me a fortunate participant with this fantastic experience. ~ Naidu Ratnala Thulaja, Singapore I think it is a very useful course, it is exactly what it says it is: an introduction to bioinformatics. It covers nicely major topics and provides enough information in order for us to understand what bioinformatics is all about. I enjoyed it very much and I am even a bit sad it is over. Thank you very much! ~ Patricia Severino, Romania Emergence of Grid Technologies “The Grid” - Grid Computing Next Generation Internet technologies (Internet2) and their applications Computational Grids Informational Grids Access Grid Educational Grids do the same for the educational process – the learner or the teacher can tap into learning materials, tools, information, computational hands-on, in the so-called classroom without walls! Educational Grid for Bioinformatics Increase repository of regularly used bioinformatics software Registry of tools, software and databases Higher level abstraction of resources Virtual classrooms and discussions Distributed repository of learning objects and materials Self assessment tests Project Based modules Problem Based Learning Integrated learning environment for the practice of bioinformatics in the life sciences Support both conventional and e-learning/e-education Problem-Based Learning (PBL) • Started at McMaster University Medical School over 25 years ago • Encourages hand-on and critical thinking. Its handson approach is particular suited for bioinformatics where many of the skills require practical execution and the problems encountered are generally openended. • PBL encourages : acquisition of critical knowledge. problem solving proficiency; problems tackled are generally open-ended. self-motivated learning. team participation. Role Change • In PBL, there’s a fundamental change in the role played by the participants. a facilitator guides the entire session. a scribe records the entire session. some participants field questions; others try to brainstorm and provide answers. There will not be student-teacher relationship,everybody is treated equally. Focus is on peer learning PBL Asynchronous Sessions • S* is currently experimenting PBL session using IVLE discussion forum and eventually web-based collaboration platform – TWiKi (http://twiki.org) • Consideration/Issues to resolve : How to accommodate so many participants How to host so many TWiKi page Will participants with slow connection able to access ? PBL synchronous sessions Emeet.nus.edu.sg CENTRA technology Low bandwidth requirement VOIP for voice, Video if necessary Agenda, Whiteboard, Shared applications, File transfer, Web Safari Projects 8 different projects 8 teams of volunteer facilitators 300 students into 8 groups Two phases Set them up to solve various topical bioinformatics problems from bottom up in PBL style. Online Delivery Mechanism • Consider and want to explore various advanced networking technologies particularly on video conferencing software. e.g. AccessGridTM http://www.accessgrid.org/ AccessGrid TM • It is a suite of resources including multimedia largeformat displays, presentation and interactive environments, and interfaces to Grid middleware and to visualization environments. • Developed by the Futures Laboratory at Argonne National Laboratory and deployed by the NCSA PACI Alliance, it is now used over 150 institutions worldwide with each institution hosting one or more Access Grid (AG) node. • Each node employs high-end audio and visual technology needed to provide a high-quality compelling user experience. Immersive Learning Enable group-to-group interactions across the Grid. Activities such as large-scaleFig 1: Controlling Audio/Visual Quality distributed meetings, collaborative work sessions, seminars, lectures, tutorials, and training are made possible. Fig 2: Group-to-Group Live Interaction Issues & Consideration Infrastructure (high speed network, connection/bandwidth) Cost of setting up Location of set-up Manpower required Technical competency Workflow as the killer app KOOPrime’s LivePortal/LifeBase and KOOPlatform Carole Goble’s myGrid, Taverna, etc Anabench Vibe Bingo All with killer GUI Others such as ASP model – Bioinformatics .com/Entigen’s BioNavigator KooP Testbed on APBioGRID Management Microarray Database Email DB Update Remote Management of biosamples and distributed statistical analysis Implementation: Laboratory Integration Allows users to select vendor plates for processing Generate in-house plates from vendor plates Print barcodes for each selected plate Start up legacy dispenser software Auto-import output files of dispenser into database Email Email user if there is any error in processing DB Update Analysis of Results and DistributedComputing Grid Based Workflows: A High-Level View Operator View Administrator View Browse Drag Drop Connect Scheduling Functions Search and Resource Discovery functions Description Annotation Authoring Function Sharing/Publishing/ Resource Browsing functions Future BioGRID Components Bio End User Web Interface KOOP Interface Bio Applications (EMBOSS, PHYLIP, FASTA, SSEARCH) Bio Applications (EMBOSS, PHYLIP) Globus-aware Scheduler Globus-aware Scheduler (Nimrod-G) (Nimrod-G) Globus Globus OS OS CPU CPU Sun SGE LSF Weaving the threads of development In Networking In Bioinformatics Software application packages In BioDataGrid In Online Educational Learning Objects Bioinformatics Educational Grid and Bio World Wide WorkFlow Bio W3F Output From previous object Workflow Object Output is Input to the next object Ingredients of W3F W3F orchestrator KOOPserver W3F service providers Apps developers KOOPsdk W3F enactors KOOPdaemon W3F browser KOOPbrowser W3F editor KOOPeditor •Users can browse workflows and cobble together app objects to reuse, repurpose objects or workflows •Apps developers can wrap their applications and advertise to potential users and service providers •Service providers can mount apps from apps developer •W3F orchestrator coordinates scheduling, load balancing, security etc Why suitable for grid? KOOPdaemons can call grid commands through grid portals KOOPsdk easily wraps your existing applications, including grid ones KOOPsdk can also call grid commands of say Globus grid toolkits Layered approach for rapid uptake. Framework of Bioinformatics Development in Asia Pacific from 1991-2004 POLICY RESEARCH EDUCATION & Manpower Training Compute INFRASTRUCTURE DATA INFRASTRUCTURE NETWORK INFRASTRUCTURE Collaboration Cooperation The Future Defining an Evolving Educational Grid for bioinformatics Continuing Major Impact of ICT in the Life Sciences Synergistic and sustained growth of two major late 20th Century technologies Building the framework for World Wide Workflow Share resources, access resources seamlessly Build sophisticated automated workflows comprising interconnection of people, computation, data and bioinstrumentation Thank you for this opportunity to share this with you. Tan Tin Wee Tinwee@bic.nus.edu.sg