BioMashups: The New World of Exploratory Bioinformatics? Jiro Sumitomo Felicity Newell

advertisement
BioMashups: The New World
of Exploratory Bioinformatics?
Jiro Sumitomo, James M. Hogan,
Felicity Newell, Paul Roe
Microsoft QUT eResearch Centre
j.hogan@qut.edu.au
Bioinformatics
 Tools, Data, and linking them together
 Exploration vs. Routine Workflow
Mashups and BioMashups
 Some basics and some canonical examples
 Biomashups and their limitations
Predictin’ the future
2
Smart BioTools
Bioinformatics
Abundance of tools and data
sources
 Traditional standalone
applications
 Interactive web sites
 (More recently) web
service hooks
Usually purpose-specific tools
 Link together to solve
complex problems
Smart BioTools
Linking tools together
The workflow trade-off:
 Sophistication vs development effort
 Keep it simple, and keep the scientist involved
 Make it complex & make the scientist a client
Bench scientists usually aren’t software engineers
 But they can chain operations together if they
have the right primitives and the right glue
Smart BioTools
Extremes of Scientific Workflow
The manual data management system
 Also known as cut-and-paste from Excel
 Cannot scale, but it presents no barriers…
Robust Workflow Systems: Taverna, Kepler et. al.
 Essential for high-end instrumentation; well-
engineered, support for provenance
 But significant set-up, familiarisation…
Smart BioTools
The Middle Ground…
Scripting in perl, python et al.
 Significant programming skills needed
 Useful for well-defined processes, but
exploratory work is time consuming
 Accessing remote data and linking web
services beyond most scientists
 [A niche for biomashups?]
Smart BioTools
Mashups
Mashups are web-based applications for the
combination of data sources and services
Earliest mashups used Javascript to link exposed
service and data APIs, and to wrap existing tools
 Same issues as perl scripting, with the
additional need to organise hosting
 Little incentive to standardise or share
Smart BioTools
Mashup Frameworks
Development environments, hosting and publication
 Common interface structure
 Building a community?
Scripting for scientists?
 Overcoming the programming barrier
 Depends on the libraries, primitive ops
 And there is (usually) javascript under the hood
Smart BioTools
Some of the players…
Smart BioTools
Mashups & Data
Mashups are limited by
data exchange
Good at passing an index
to the data
Client mashup architecture
e.g Facebook
Mashup Server
Mashup
Third Party Services
e.g. Virtual
Earth
 Think latitude &
...
longitude
Bad at passing massive
data sets around
Client web browser
Smart BioTools
BioMashups
Middle ground between cut-and-paste and full
workflow management systems
 Corresponds best to perl scripting
 Ideal when user intervention is needed
 May be seen as a prototype for Workflow
 Helps to mask complex data access and search
tools which frustrate experts and drive
students to exasperation…
Smart BioTools
SDLM1
Perform a blastx on the
sequence.
Obtain the best hit/hits by
inspection of the blast output
page.
Retrieve Genbank record of
the best hit by clicking on the
link in the output page.
Determine the known regions
by inspection, in this case an
ANF_receptor.
Perform an Entrez search on
this region.
Smart BioTools
The New UG Biology: SDLM1
Perform a blastx on the sequence.
(NCBI Blast block)
Obtain the best hit/hits by inspection
of the blast output page. (NCBI Blast
result parser block)
Retrieve Genbank record of the best
hit by clicking on the link in the
output page. (RDF Block, pointing to
Bio2Rdf)
Determine the known regions by
inspection, in this case an
ANF_receptor. (The mashup parses
the RDF document instead - Bio2Rdf
Block)
Perform an Entrez search on this
region. (NCBI Entrez block)
Smart BioTools
Case Study: Analysing Proteins
Protein Characteristics
 Name, sequence
 Journal articles, cross-reference
Protein Prediction
 Molecular weight, isoelectric point
 Secondary structure, post-translational mods
Smart BioTools
Data & Services
Smart BioTools
Mashups Architecture
13 Custom Blocks
1) Input and Output
2) Processing: protein characteristics
3) Processing: protein prediction
Protein
Characteristics
Input
Combine
Output
Protein
Prediction
Smart BioTools
BioMashups for Proteins
Given its Uniprot ID, how much can we find out
about a particular protein?
Smart BioTools
BioMashups for Proteins
Given its sequence, what properties can we readily
obtain from web-based prediction services?
Smart BioTools
Predictin’ is difficult… but
Frameworks can and will support
 Ad hoc exploratory bioinformatics
 Index-based routine computation
 Building (enclave) communities
Varying levels of success in allowing
 Scientist (& student) driven mashups
 Sharing and re-use of components
Smart BioTools
Predictin’ is difficult… but
It will be a long time before mashup frameworks:
 Are used to process data from high-
throughput sequencing machines
 Process large scale collections
 Beat Taverna & Kepler at provenance
Smart BioTools
Predictin’ is difficult… BUT
Smart BioTools
Overcoming the barriers…
Building a general BioMashups community
 Cross-over between frameworks
 Seeding the community with ‘re-usable’
components and reaching critical mass
 The myExperiment BioMashups group
Bringing BioMashups to the curriculum
 The new undergraduate biology
Smart BioTools
Links
MQUTeR Bio & BioMashups
 http://www.mquter.qut.edu.au/bio/
 http://www.mquter.qut.edu.au/bio/biomashups.aspx
myExperiment BioMashups Group
 http://www.myexperiment.org/groups/99
Protein Mashups
 http://www.mquter.qut.edu.au/bio/ProteinMashupsb[1].wmv
 http://www.popfly.com/users/fsn/Protein%20Biomashups%20Summary%20p
age
Smart BioTools
Acknowledgements
Smart BioTools
Questions?
Smart BioTools
Download