Workflow Systems in Bioinformatics and the Bioinformatics

advertisement
Workflow Systems in
Bioinformatics and the
Bioinformatics Educational Grid
Tan Tin Wee
Associate Professor
National University of Singapore
tinwee@bic.nus.edu.sg
Shoba Ranganathan, Victor Tong, Justin Choo, Richard Tan,
G.S.Ong, Simon See, TS Lim, Mark de Silva and KSLim.
International Symposium on Grid Computing ISGC2004
“Making the World Wide Grid a Reality” 27 July 2004
In a Nutshell
 Weaving several threads of development in
Bioinformatics such as Workflow Integration
and DataGrid (over the past 5 yrs or so)
 to build an integrated educational grid “info”structure
 that will support HR development, education,
training, self-learning etc in the emerging
discipline of bioinformatics
 for conventional as well as “E-” eduation
Making the World Wide Grid
a reality: Contribution of
Bioinformatics
 Bioinformatics is the science of using
information and ICT to understand
biology
 Despite being driven by rapid progress in allied
disciplines in the “New Biology”: genomics,
proteomics, metabolomics, transcriptomics, other
‘omics, computational biology, systems biology
generating unprecedented volumes of data
 Grid computing is not yet ubiquitous in life sciences
In Vitro
In Vivo
In Situ
In Silico
Biology
And
Personalised
Medicine
Imaging
Modeling
Simulation
Theoretical
Biology
D. Hanahan and R. A. Weinberg.
The hallmarks of cancer. Cell., 100(1):57–70 Review, 2000
“Tools have changed, but the job hasn’t”
Cartoons from talk by Rozhan Mohammed Idrus & Hanafi Atan, Universiti Sains
Malaysia, APAN 2003
Bioinformatics - Emergent and almost pervasive
in all biological and life science disciplines
Computational Demands and Data
Processing in Life Sciences
are expanding!
‘omics
Genomics
Proteomics
Bioinformatics
Computational
Biology
Medical
Informatics
BioStatistics
LIFE SCIENCE INFORMATICS
LIFE SCIENCES and HEALTH SCIENCES
Where does the Grid fit in?
Life Science
Informatics
BIOTECHNOLOGY
and NEW BIOLOGY
INFOCOMMUNICATIONS
TECHNOLOGY
Why no Grid here yet?
 Lack of widespread awareness and training in computational
skills in the life sciences community
 Few computational, networking and grid computing experts
with first hand domain knowledge in life sciences
 Data-intensive nature of life science grid computing
applications
 Labour-intensive nature of building life science grids
 Lack of Killer Applications
 Bioinformatics is a Rapidly changing target
Biotech and InfoComm Technology
- Parallel Growth
Systems
Worm
Genome
Microbial
Genomes
Genome
Project
1990
92
Dolly &
DNA
Human
chips
Genome
Biology
Biotechnology
94
96
98 2000 2002
BioX
2004
InfoCommunication Technology
Dotcom boom
And crash
Wais
Gopher
Lambda
Networking
Grid Computing
Internet 2
WWW
boom
Java
ISP
Grids applied to Life Sciences
 Internet2 demos of late 1990s
Quasi-realtime data collection from
synchrotrons for 3D structure determination
 iGrid98, SC’98, SC’99, SC2003 (most
geographically dispersed grid computing
award – arthropod phylogenetics)
 Anthrax research – United Devices
 Encyclopedia of Life (EOL)
 OBIGrid
 Kansai BioGrid
  Large scale mega projects
  Not a WorldWide Grid
iGrid98
SC’98
http://www.startap.net/startap/igrid98/maxLikeAnApbionet98.html
INET’99
demo
http://www.bic.nus.edu.sg/admin/News/Jun99/inet/inet99.html
http://www.startap.net/startap/APPLICATIONS/collabForStruct.html
When will World Wide Grid
be a reality for Life sciences?
 Like World Wide Web – everyone uses it,
from publication and accessing the content
 Plug and play: Tap computational cycles
anywhere from everywhere anytime
 Secure to use
 Killer application like Mosaic in 1993
 Generate meaningful results
 Control key tools and automate mundane
processes
 Connect people, computation, data,
instruments
Focus on two key areas
 Grid-enabled bioinformatics workflows
systems as the killer application
 Building a bioinformatics educational
grid
Workflow integration
 1996/7 Java based FlowBot project
 1998 Inet98 Internet Flowbot Protocol
 http://www.isoc.org/inet98/proceedings/8x/8x_1.htm
 1998 Application to Life Sciences – Workflow
Integration – BIC-CNPR joint project – Lim et al
 1998 PSB’98 From Sequence to Structure to
Literature: The protocol approach to Bioinformation
Wu et al  spinoff company GeneticXchange.com
 2001 Spinoff Company KOOPrime Pte Ltd
 2002 BioWorldWideWorkFlow initiative in APBioNet
 Workflow integration is the Killer Application for a
World Wide Life Science Grid!
Bioinformatics Educational
Grid
 2001 - S* Life Sciences Informatics Alliance –
3 years of experience in online bioinformatics education:
5 courses and >1000 persons worldwide trained in
basic bioinformatics
Team of Online Teaching Assistants
 Workshop on Education in Bioinformatics: WEB01,
WEB02, WEB03, WEB04
 2004 – Problem Based Learning PBL
in Bioinformatics online using
emeet.nus.edu.sg
 2004 – Building the Bioinformatics Educational Grid
 Education is the answer to making the
World Wide Grid a reality
Background
 Biologists and Biotechnologists need to be
equipped and trained to carry out tomorrow’s
biological research today!
 Integration of
 Network Infrastructure
 Databases
 Software
 Computational Grid
 Online educational and teaching and learning
materials
 Education + Killer Application
1993
1. Network Infrastructure
APAN Advanced
Research Network 1996-2004
Internet2 and beyond
 1st Country outside North America to connect:
SINGAREN – Singapore Advanced
Research and Education Network
 TANET2 from Taiwan and APAN-Transpac
were next.
 Then Abilene….
 … Today’s Starlight and Lambda networking
2. Databases
 Key major databases - 1.5 Terabytes today!
 Publicly accessible data over the Internet doubling every
12 to 18 months http://www.bio-mirror.net/
 Mirroring Moore’s Law for chip technology
BIODATABASES
Genbank
Genbank Genomes
InterPro
PDB
BlastDB
BLOCKS
DDBJ
PIR
EMBL
PFAM
ENZYME
REBASE
PROSITE NCBI REFSEQ
SRC
SWISSPROT
Taxonomy
TrEMBL
UniGene
euGenes
BioDataGrid: Registry of
Databases
 NUS BioDataGrid initiative
everest.bic.nus.edu.sg/lsdb
 Singapore National Grid Office has a
new initiative – to be announced soon.
 Facilitate varying levels of granularity
of access to structured and
unstructured biological data
3. Software
 APBioBox project
 Funded by IDRC Pan Asia Networking
R&D Grant
 Rapid and Easy Replication of Grid
enabled software crucial to grid growth
3. Software - APBioBox
 Funded by International Development
Research Centre of Canada, under their
PAN Pan
Asia Networking ICT grant
 To build an easily installable, widely
and freely accessible, integrated suite
of bioinformatics applications to faciliate
training and research amongst biologists
in developing countries



A/P Tan Tin Wee,
National University of Singapore
Adjunct Professor Shoba Ranganathan, NUS and Chair Professor, Macquarie
University, Sydney
Ong Guan Sin, Consultant programmer, Singapore Computer Systems Pte Ltd
3. Software - APBioBox
 Shrink-wrapped bundle of some 300
software applications used in bioinformatics
 Preconfigured and integrated
 15 mins to install on a Linux RedHat9
platform which typically takes several weeks
to set up.
 Partnered with Sun Microsystem to come up
with Bio-Cluster Grid, the equivalent in Sun
Solaris platform.
 CDROMs and Downloadable
http://www.apbionet.org/apbiogrid/apbiobox
3. Software – APBioBox appls
Logical Abstraction through Java Wrappers built for:
 EMBOSS ~160 applications
 PHYLIP ~30 applications
 HMMER
 CLUSTALW
 BLAST
 FASTA, SSEARCH (in progress)
 MySQL
 SRS (Lion Bioscience)
 Globus Grid Toolkit 2.4
 Unix Utilities
 KOOPlite
 Key Bioinformatics Databases (in progress)
BioBox for Solaris
Grid Engine Portal
4. Computational Grid
APBioGrid 2002
To faciliate the building of a shared computational grid resources for the
Asia Pacific region.
APBioGRID Project
CRAY
APBioGrid
Aims to provide computational resources to bioinformaticians and
biological researchers to facilitate education and research through
sharing each other’s computers over the Grid
Why APBioNet Grid is needed?
 Large-scale [life] science [..] are done
through the interaction of people,
heterogeneous computing resources,
information systems, and instruments, all of
which are geographically and
organizationally dispersed.
 The overall motivation for “Grids” is to
facilitate the routine interactions of these
resources in order to support large-scale [life]
science […].
Altered from Bill Johnston 27 July 01
Why the “Grid”?
 1998: advent of Grid Computing – distributed
computing
 E.g. Tapping idle CPU cycles globally in the
SETI project or the Anthrax online projects.
 “Like tapping electrons from the power grid,
just plug in the appliance into the socket”
 Currently, one of the
hottest areas in ICT.
 So the basis for BioGrids
has been laid
5. Online Learning Material
Eight institutions from 5 continents since 2001
– The S* Life Science Informatics Alliance
Sweden
University
of Uppsala
Karolinsk
a
Institutet
USA
Stanford University
University of California,
San Diego
National University of
Singapore
Singapore
Australia
University of Sydney
South Africa
University of the Western Cape
Macquarie University
Sample Lecture- Slide View
Wide Range of S* learning
materials
- Tutorial ppt presentation materials on introductory bioinformatics
- Frequently Asked Questions in Forum discussion archives
- Overview lectures on:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Introductory Molecular Biology
An Overview of the Computational Analysis of Biological Sequences
Transcript Analysis and Reconstruction
Comparative Genomics
Representations and Algorithms for Computational Molecular Biology
Protein Structure Primer, Structure Prediction and Protein Physics
Genomics and Computational Molecular Biology Genomics
Protein and Nucleic Acid Structure, Dynamics,and Engineering
Proteomics and Proteomes
Structure Prediction for Macromolecular Interactions
Protein - Ligand Modeling
Microarray informatics
Goals of S*
• Provide a GLObal Bioinformatics Unified Learning
Environment (GLOBULE) made up of modular
courses in the disciplines of bioinformatics, medical
informatics and genomics
• Provide accessibility to the highest possible quality
of online courseware approved by the educators
from the host institutions.
• Develop an integrated modular learning
environment that allows a student to select from
both pre-requisite modules and advanced modules
in order to build a comprehensive program.
S* course-3: by country
S* course Growing List of
Participants’ Countries
S* Geographical Comparison
Participants Distribution In 1st Course Compared
Against In 3rd Course
70
60
50
40
30
20
10
0
australasia
1st Course
3rd Course
asia
north america
europe
africa
south america
Feedback
 Pretty good. A few rough edges but I'm sure you'll work
them out over time. I really enjoyed it. Most of the lectures
were very well presented and the participants in the forums
helpful. I'm very impressed at the amount of work that has
obviously gone into setting up the course. ~ Alan
Wardroper, Thailand
 The international participation of the lecturers and students.
The relevance of the field of bioinformatics in meeting the
biomedical needs of today. The level of communication
provided by the IVLE system enhanced learning
considerably. The range of professional and academic
background of students. The technical support provided by
SStar was rapid and efficient to queries.
~ C.A.O. IDOWU, England
Feedback
 To think that a world-class, web based education with such
valued lectures is brought to your desk free of cost is
impossible elsewhere. The course was wonderfully well
managed. Our requests and problems were quickly and
well attended to. I had a great time doing this course and
thank the S*STAR team whole heartedly for making me a
fortunate participant with this fantastic experience.
~ Naidu Ratnala Thulaja, Singapore
 I think it is a very useful course, it is exactly what it says it
is: an introduction to bioinformatics. It covers nicely major
topics and provides enough information in order for us to
understand what bioinformatics is all about. I enjoyed it
very much and I am even a bit sad it is over. Thank you
very much! ~ Patricia Severino, Romania
Emergence of Grid
Technologies
 “The Grid” - Grid Computing
 Next Generation Internet technologies
(Internet2) and their applications
 Computational Grids
 Informational Grids
 Access Grid
 Educational Grids  do the same for the
educational process – the learner or the
teacher can tap into learning materials, tools,
information, computational hands-on, in the
so-called classroom without walls!
Educational Grid for
Bioinformatics









Increase repository of regularly used bioinformatics software
Registry of tools, software and databases
Higher level abstraction of resources
Virtual classrooms and discussions
Distributed repository of learning objects and materials
Self assessment tests
Project Based modules
Problem Based Learning
Integrated learning environment for the practice of
bioinformatics in the life sciences
 Support both conventional and e-learning/e-education
Problem-Based Learning
(PBL)
• Started at McMaster University Medical School over
25 years ago
• Encourages hand-on and critical thinking. Its handson approach is particular suited for bioinformatics
where many of the skills require practical execution
and the problems encountered are generally openended.
• PBL encourages :
acquisition of critical knowledge.
problem solving proficiency; problems tackled are generally
open-ended.
self-motivated learning.
team participation.
Role Change
• In PBL, there’s a fundamental change
in the role played by the participants.
a facilitator guides the entire session.
a scribe records the entire session.
some participants field questions; others
try to brainstorm and provide answers.
There will not be student-teacher
relationship,everybody is treated equally.
Focus is on peer learning
PBL Asynchronous Sessions
• S* is currently experimenting PBL session
using IVLE discussion forum and eventually
web-based collaboration platform – TWiKi
(http://twiki.org)
• Consideration/Issues to resolve :
 How to accommodate so many participants
 How to host so many TWiKi page
 Will participants with slow connection able to
access ?
PBL synchronous sessions
 Emeet.nus.edu.sg
 CENTRA technology
 Low bandwidth requirement
 VOIP for voice, Video if necessary
 Agenda, Whiteboard, Shared
applications, File transfer, Web Safari
Projects
 8 different projects
 8 teams of volunteer facilitators
 300 students into 8 groups
 Two phases
 Set them up to solve various topical
bioinformatics problems from bottom
up in PBL style.
Online Delivery Mechanism
• Consider and want to explore various
advanced networking technologies
particularly on video conferencing software.
 e.g. AccessGridTM
http://www.accessgrid.org/
AccessGrid
TM
• It is a suite of resources including multimedia largeformat displays, presentation and interactive
environments, and interfaces to Grid middleware and to
visualization environments.
• Developed by the Futures Laboratory at Argonne
National Laboratory and deployed by the NCSA PACI
Alliance, it is now used over 150 institutions worldwide
with each institution hosting one or more Access Grid
(AG) node.
• Each node employs high-end audio and visual
technology needed to provide a high-quality compelling
user experience.
Immersive Learning
 Enable group-to-group
interactions across the Grid.
 Activities such as large-scaleFig 1: Controlling Audio/Visual Quality
distributed meetings,
collaborative work sessions,
seminars, lectures, tutorials,
and training are made
possible.
Fig 2: Group-to-Group Live Interaction
Issues & Consideration
 Infrastructure (high speed network,
connection/bandwidth)
 Cost of setting up
 Location of set-up
 Manpower required
 Technical competency
Workflow as the killer app
 KOOPrime’s LivePortal/LifeBase and
KOOPlatform
 Carole Goble’s myGrid, Taverna, etc
 Anabench
 Vibe
 Bingo
 All with killer GUI
 Others such as ASP model –
Bioinformatics .com/Entigen’s BioNavigator
KooP Testbed on APBioGRID
Management Microarray
Database
Email
DB
Update
Remote Management of
biosamples and distributed
statistical analysis
Implementation:
Laboratory Integration
 Allows users to select vendor plates for processing
 Generate in-house plates from vendor plates
 Print barcodes for each selected plate
 Start up legacy dispenser software
 Auto-import output files of dispenser into database
Email
 Email user if there is any error in processing
DB
Update
Analysis of Results and
DistributedComputing
Grid Based Workflows:
A High-Level View
Operator View
Administrator View
Browse Drag Drop Connect
Scheduling Functions
Search and
Resource
Discovery
functions
Description
Annotation
Authoring
Function
Sharing/Publishing/
Resource Browsing
functions
Future BioGRID
Components
Bio End User
Web
Interface
KOOP
Interface
Bio Applications (EMBOSS,
PHYLIP, FASTA, SSEARCH)
Bio Applications
(EMBOSS, PHYLIP)
Globus-aware Scheduler
Globus-aware Scheduler
(Nimrod-G)
(Nimrod-G)
Globus
Globus
OS
OS
CPU
CPU
Sun
SGE LSF
Weaving the threads of
development
 In Networking
 In Bioinformatics Software application
packages
 In BioDataGrid
 In Online Educational Learning Objects
 Bioinformatics Educational Grid and
Bio World Wide WorkFlow Bio W3F
Output
From previous object
Workflow
Object
Output is
Input to the next object
Ingredients of W3F
W3F orchestrator
KOOPserver
W3F service providers
Apps developers
KOOPsdk
W3F enactors
KOOPdaemon
W3F browser
KOOPbrowser
W3F editor
KOOPeditor
•Users can browse workflows and cobble together
app objects to reuse, repurpose objects or workflows
•Apps developers can wrap their applications and
advertise to potential users and service providers
•Service providers can mount apps from apps developer
•W3F orchestrator coordinates scheduling, load balancing, security etc
Why suitable for grid?
 KOOPdaemons can call grid
commands through grid portals
 KOOPsdk easily wraps your existing
applications, including grid ones
 KOOPsdk can also call grid commands
of say Globus grid toolkits
 Layered approach for rapid uptake.
Framework of Bioinformatics
Development in Asia Pacific
from 1991-2004
POLICY
RESEARCH
EDUCATION &
Manpower Training
Compute INFRASTRUCTURE
DATA INFRASTRUCTURE
NETWORK INFRASTRUCTURE
Collaboration Cooperation
The Future
 Defining an Evolving Educational Grid for bioinformatics
 Continuing Major Impact of ICT in the Life Sciences
 Synergistic and sustained growth of two major late 20th
Century technologies
 Building the framework for World Wide Workflow
 Share resources, access resources seamlessly
 Build sophisticated automated workflows comprising interconnection of
people, computation, data and bioinstrumentation
 Thank you for this opportunity to share this with
you.
 Tan Tin Wee
 Tinwee@bic.nus.edu.sg
Download