BioMashups: The New World of Exploratory Bioinformatics? Jiro Sumitomo, James M. Hogan, Felicity Newell, Paul Roe Microsoft QUT eResearch Centre j.hogan@qut.edu.au Bioinformatics Tools, Data, and linking them together Exploration vs. Routine Workflow Mashups and BioMashups Some basics and some canonical examples Biomashups and their limitations Predictin’ the future 2 Smart BioTools Bioinformatics Abundance of tools and data sources Traditional standalone applications Interactive web sites (More recently) web service hooks Usually purpose-specific tools Link together to solve complex problems Smart BioTools Linking tools together The workflow trade-off: Sophistication vs development effort Keep it simple, and keep the scientist involved Make it complex & make the scientist a client Bench scientists usually aren’t software engineers But they can chain operations together if they have the right primitives and the right glue Smart BioTools Extremes of Scientific Workflow The manual data management system Also known as cut-and-paste from Excel Cannot scale, but it presents no barriers… Robust Workflow Systems: Taverna, Kepler et. al. Essential for high-end instrumentation; well- engineered, support for provenance But significant set-up, familiarisation… Smart BioTools The Middle Ground… Scripting in perl, python et al. Significant programming skills needed Useful for well-defined processes, but exploratory work is time consuming Accessing remote data and linking web services beyond most scientists [A niche for biomashups?] Smart BioTools Mashups Mashups are web-based applications for the combination of data sources and services Earliest mashups used Javascript to link exposed service and data APIs, and to wrap existing tools Same issues as perl scripting, with the additional need to organise hosting Little incentive to standardise or share Smart BioTools Mashup Frameworks Development environments, hosting and publication Common interface structure Building a community? Scripting for scientists? Overcoming the programming barrier Depends on the libraries, primitive ops And there is (usually) javascript under the hood Smart BioTools Some of the players… Smart BioTools Mashups & Data Mashups are limited by data exchange Good at passing an index to the data Client mashup architecture e.g Facebook Mashup Server Mashup Third Party Services e.g. Virtual Earth Think latitude & ... longitude Bad at passing massive data sets around Client web browser Smart BioTools BioMashups Middle ground between cut-and-paste and full workflow management systems Corresponds best to perl scripting Ideal when user intervention is needed May be seen as a prototype for Workflow Helps to mask complex data access and search tools which frustrate experts and drive students to exasperation… Smart BioTools SDLM1 Perform a blastx on the sequence. Obtain the best hit/hits by inspection of the blast output page. Retrieve Genbank record of the best hit by clicking on the link in the output page. Determine the known regions by inspection, in this case an ANF_receptor. Perform an Entrez search on this region. Smart BioTools The New UG Biology: SDLM1 Perform a blastx on the sequence. (NCBI Blast block) Obtain the best hit/hits by inspection of the blast output page. (NCBI Blast result parser block) Retrieve Genbank record of the best hit by clicking on the link in the output page. (RDF Block, pointing to Bio2Rdf) Determine the known regions by inspection, in this case an ANF_receptor. (The mashup parses the RDF document instead - Bio2Rdf Block) Perform an Entrez search on this region. (NCBI Entrez block) Smart BioTools Case Study: Analysing Proteins Protein Characteristics Name, sequence Journal articles, cross-reference Protein Prediction Molecular weight, isoelectric point Secondary structure, post-translational mods Smart BioTools Data & Services Smart BioTools Mashups Architecture 13 Custom Blocks 1) Input and Output 2) Processing: protein characteristics 3) Processing: protein prediction Protein Characteristics Input Combine Output Protein Prediction Smart BioTools BioMashups for Proteins Given its Uniprot ID, how much can we find out about a particular protein? Smart BioTools BioMashups for Proteins Given its sequence, what properties can we readily obtain from web-based prediction services? Smart BioTools Predictin’ is difficult… but Frameworks can and will support Ad hoc exploratory bioinformatics Index-based routine computation Building (enclave) communities Varying levels of success in allowing Scientist (& student) driven mashups Sharing and re-use of components Smart BioTools Predictin’ is difficult… but It will be a long time before mashup frameworks: Are used to process data from high- throughput sequencing machines Process large scale collections Beat Taverna & Kepler at provenance Smart BioTools Predictin’ is difficult… BUT Smart BioTools Overcoming the barriers… Building a general BioMashups community Cross-over between frameworks Seeding the community with ‘re-usable’ components and reaching critical mass The myExperiment BioMashups group Bringing BioMashups to the curriculum The new undergraduate biology Smart BioTools Links MQUTeR Bio & BioMashups http://www.mquter.qut.edu.au/bio/ http://www.mquter.qut.edu.au/bio/biomashups.aspx myExperiment BioMashups Group http://www.myexperiment.org/groups/99 Protein Mashups http://www.mquter.qut.edu.au/bio/ProteinMashupsb[1].wmv http://www.popfly.com/users/fsn/Protein%20Biomashups%20Summary%20p age Smart BioTools Acknowledgements Smart BioTools Questions? Smart BioTools