Motivation: modernizing (life) science 2 Homo biologicus in his natural habitat 3 A mailing list for the Homo biologicus 4 Sharing a calendar for the Homo biologicus 5 /* * determines ridges in htm expression table */ #include "ridge.h" int selecthtm(PGconn *conn, char *htmtablename, char *chromname, PGresult *htmtable) { char querystring[256]; sprintf("SELECT * FROM %s WHERE chrom = %s ORDER BY genstart", htmtablename, chromname); htmtable = PQexec(conn, querystring); } return(validquery(htmtable, querystring)); int is_ridge(PGresult *htmtable, int /* determines if mincount genes in a /* pre: htmtable is valid and sorted /* post: { if (mincount<=0) row, double exprthreshold, int mincount) row are (part of) a ridge */ on genStart (ascending) return TRUE; if (row>=PQntuples(htmtable)) return FALSE; } int main() { if(PQgetvalue(htmtable, 0, PQfnumber(htmtable, "movmed39expr")) < exprthreshold) { return FALSE; } return(is_ridge(htmtable, ++row, exprthreshold, --mincount)); PGconn *conn; /* holds database connection */ char querystring[256]; /* query string */ PGresult *result; int i; conn = PQconnectdb("dbname=htm port=6400 user=mroos password=geheim"); if (PQstatus(conn)==CONNECTION_BAD) { fprintf(stderr, "connection to database failed.\n"); fprintf(stderr, "%s", PQerrorMessage(conn)); exit(1); } else printf("Connection ok\n"); sprintf(querystring, "SELECT * FROM chromosomes"); printf("%s\n", querystring); result = PQexec(conn, querystring); if (validquery(result, querystring)) { printresults(result); } else { PQclear(result); PQfinish(conn); return FALSE; } } PQclear(result); PQfinish(conn); return TRUE; int printresults(PGresult *tuples) { int i; } 14/09/2009 for (i=0; i< PQntuples(tuples) && i < 10; i++) { printf("%d, ", i); printf("%s\n", PQgetvalue(tuples,i,0)); } return TRUE; int validquery(PGresult *result, char *querystring) { printf(" in validquery\n"); if (PQresultStatus(result) != PGRES_TUPLES_OK) { printf("Query %s failed.\n", querystring); fprintf(stderr, "Query %s failed.\n", querystring); return FALSE; } return TRUE; } BioAID 6 1070 databases Nucleic Acids Research Jan 2008 (96 in Jan 2001) Proteomics, Genomics, Transcriptomics, Protein sequence prediction, Phenotypic studies, Phylogeny, Sequence analysis, Protein Structure prediction, Protein-protein interaction, Metabolomics, Model organism collections, Systems Biology, Epidemiology, etcetera … All with a splendid interface … all different, of course 7 Homo biologicus’ bioinformatics Local Database Local Database 14/09/2009 BioAID 8 Homo digitalis in his natural habitat 9 Homo digitalis in his natural habitat 10 Workflows 11 Semantic Web (Linked Open Data) 12 Homo biologicus Lots of data to deal with Single tiny brain Lots of knowledge to deal with Lots of methods and algorithms to try and combine No computational superpowers A needy biologist 13 Homo biologicus enhancis Lots of accessible data Knowledge bases to query Web Services, Workflows, and their creators available Community brain power Other people’s computational superpowers An enhanced biologist 14 e-Laboratories and e-Laboratory factories 15 Context: BioAssist Bioinformatics Support 16 An existing & acknowledged ‘e-Laboratory’ 17 In the e-Laboratory Factory • Galaxy as front end • Workflows & Web Services • Grid enabled Taverna • MOLGENIS • Semantic/Concept Web • myExperiment/BioCatalogue • Scientific Research Objects 18 Scientific Research Object 1.0 19 Anatomy of a Research Object Research questions on SROs • How to record & represent scientific collections? – OAI-ORE serialised in RDF – Life cycle modeled on scientific publication: Draft->Review->Publication->Deprecation • How do we describe the resources within our Research Object? – Dublin Core , SIOC, Research Object Upper Model (ROUM) • How to capture/represent Trustworthiness? • How much are scientists willing to share? A pilot • Create an executable Scientific Research Object that holds – Galaxy tool models + resources – Taverna workflow(s) – MOLGENIS data models and UI models – Metadata on the experiment • Tab delimited, linked to concepts • Samples, subjects, observations, etc. – Other (references) to datasets • Execution through Galaxy 22 SRO = a pack of models - Tool models SRO enactment = a running e-laboratory - Data/ui models - Flow models +Attached data Model SROs Tools my protocols my data my protocols my data mashup data Flows e-bioinformatician programmatic interaction user interfacing Data 2.0 mashup tools e-biologist e-Galaxy mock-up Your Scientific Research Object Running workflow MOLGENIS Convert Import/Export Research Object Store Configure Run MOLGENIS Convert Import/Export Research Objects Store Configure Run Related research and documents Adlsjflad jslf adsflkj alfd adsf Adflja dlfkjal adlfj lakdjflkj adf Adflkj lakjlkjadsf lakdfjlf ladoioewn Jlakdsfo oiuw fja oija oisdflv oaijdf 24 e-Galaxy mock-up Running workflow MOLGENIS Convert Import/Export Research Objects Store Configure Run Related research and documents Adlsjflad jslf adsflkj alfd adsf Adflja dlfkjal adlfj lakdjflkj adf Adflkj lakjlkjadsf lakdfjlf ladoioewn Jlakdsfo oiuw fja oija oisdflv oaijdf 25 e-Galaxy mock-up Suggestions by semantic components Your Scientific Research Object Underlying workflow MOLGENIS Convert Import/Export Research Objects Store Configure Run Related research and documents Adlsjflad jslf adsflkj alfd adsf Adflja dlfkjal adlfj lakdjflkj adf Adflkj lakjlkjadsf lakdfjlf ladoioewn Jlakdsfo oiuw fja oija oisdflv oaijdf 26 e-Science requirement: Reuse 27 http://www.epigenius.org/ (mock-up) 28 E-Lab Vacancies in the Netherlands (blatant advertisement) http://snipurl.com/elabjobs [OMII-UK(myGrid)/NBIC collaboration] • Software engineer e-Laboratories – Taverna components for Galaxy/Liferay – Semantic/Concept Web meta-analysis • e-biologist (PhD student) – Workflow & semantic web applied to epigenetics => epiGenius portal • Grid engineers & post-doc medical applications – http://www.vl-e.nl/vlemed/vacancies.html 29 BioAssist Requirements • Help bioinformaticians help biologists • Serve bioinformatics community – Analysis pipelines parse large datasets – Local/external, small/large databases – Data for humans and machines – Knowledge for humans and machines * Plugin developed by Richard Holland (Eagle Genomics) for SARA and NBIC 31 More specifically Requirements and tools • Analysis pipelines that parse large datasets Taverna with plugin for Grid access* Taverna platform for e-Labs • Biological databases (small and large) MOLGENIS REST/SOAP/CSV Galaxy/Taverna • Data to be exchanged by humans and machines Scientific Research Objects/myExperiment PACKs • Biological meaning disclosed and linked Concept Web and Semantic Web Semantically enabled Taverna * Plugin developed by Richard Holland (Eagle Genomics) for SARA and NBIC 32 MOLGENIS research portal generator: Input: model of my research Output: auto-generates software files Rich user interfaces for biologists plugin your handwritten scripts (tools,workflows) Programming interfaces for bioinformaticians Connect to R statistics m<-find.markers() 544 markers downloaded. … library(qtl) #qtl analysis here Workflow ready webservices Rich documentation and UML diagrams add.data(qtl, name = “QTLs”) 2,448,000 data elements added. CSV exchange format strain.txt specie s.txt protocol.txt probe.txt m ark er.txt in vestigation.txt ind ivid ual.txt gene .txt data.txt constant.properties data Strongly typed framework t Data storage optimized for HTP genomics db files http://www.molgenis.org Swertz & Jansen (2007) Nature Reviews Genetics 8, 235-243 A putative scenario in addition With Galaxy With ‘e-Galaxy’ • Select genome annotation track from UCSC genome browser, load into Galaxy • Combine with other data resources and local data • Perform a region selection algorithm • Collect regions of interest • Save successful steps • Data disclosed with MOLGENIS • References to datasets stored in SRO • Run region selection workflow (show process) • Run meta-analysis – Parse metadata for concepts – Run meta-analysis – Present additional information 34