Internet Data, standards and tools for the internet Lena Strömbäck november 2008 2 Tertiary str. PDB Biological data Topics ¾ Part 1: Data and representation ¾ Databases on the web ¾ XML standards ¾ Minimal Information Standards ¾ Part 2: Tools ¾ General Overview ¾ Databases and data mangement ¾ Putting things together: Scientific S f workflows f DNA seq. GenBank … Protein seq. SWISS PROT SWISS-PROT Secondary str. PROSITE Signaling pathway SPAD Taxonomy AmiGO INSULIN november 2008 3 Biological data sources Example of web sources Molecular Biology Database List at Nucleic Acids Research ¾ ¾ ¾ ¾ ¾ ¾ number o of data source es 800 700 600 500 400 Biomodels Reactome IntAct KEGG Uniprot CheBi http://www ebi ac uk/biomodels-main/static-pages http://www.ebi.ac.uk/biomodels main/static pages.do?page=home do?page=home http://www.reactome.org/ http://www.ebi.ac.uk/intact/site/index.jsf http://www.genome.jp/kegg/ http://www.uniprot.org/ http://www.ebi.ac.uk/chebi/ 300 200 100 0 2000 2001 2002 2003 2004 2005 year november 2008 5 november 2008 6 1 How to make use of this data? So … ¾ Availability of web sources within bioinformatics is unique ¾ How can we make best use of them? Common terminology Identify objects Representt R relations november 2008 7 november 2008 How to represent data … 8 SBML ¾ Representation of structure or relations ¾ Web data is often less structured than traditional applications ¾ Defined by: System Biology Development workbench group. ¾ Aim: Standard for information exchange. ¾ Mathematical formulas can be defined,, discrete events. ¾ Semi structured ¾ Integration ¾ XML is the most common language ¾ OWL and RDF are alternatives november 2008 9 november 2008 XML model example <model id="Tyson1991CellModel_6" name="Tyson1991_CellCycle_6var"> <listOfSpecies> + <species id="C2" name="cdc2k" compartment="cell"> + <species id="M" id= M name= name="p p-cyclin_cdc2 cyclin cdc2" compartment= compartment="cell"> cell > + <species id="YP" name="p-cyclin" compartment="cell"> … </listOfSpecies> <listOfReactions> <reaction id="Reaction1" name="cyclin_cdc2k dissociation"> <annotation> <rdf:li rdf:resource="http://www.reactome.org/#REACT_6308"/> <rdf:li rdf:resource="http://www.geneontology.org/#GO:0000079"/> </annotation> <listOfReactants> <speciesReference species species="M"/> M /> </listOfReactants> <listOfProducts> <speciesReference species="C2"/> <speciesReference species="YP"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci> k6 </ci> <ci> M </ci> </apply></math> <listOfParameters> <parameter id="k6" value="1“> </listOfParameters> </kineticLaw> </reaction> + <reaction id="Reaction2" name="cdc2k phosphorylation"> ... more reactions </listOfReactions> </model> </sbml> november 2008 11 <model> <listOfCompartments> <listOfSpecies> <listOfReactions> <reaction> <listOfReactants> <listOfProducts> <listOfModifiers> </reaction> </listOfReactions> </model> 10 PSI MI ¾ Defined by: Proteomics Standards Initiative. ¾ Aim: Standard for representation of molecular interaction. ¾ Thorough description of experiments i and d protein i structure. november 2008 <entry> <experimentList> <interactorList> <interactionList> <interaction> <experimentList> <participList> p p </interaction> </interactionList> </entry> 12 2 Available XML standards …. Name november 2008 13 Ver. Year Purpose Tools Data A computer-readable format for representing models of biochemical reaction networks. A standard for data representation for proteinprotein interaction to facilitate data comparison, exchange and verification. A collaborative effort to create a data exchange format for biological pathway h d data. Support the definition of models of cellular and subcellular processes. Many tools available. Data available from many databases, for instance, KEGG and Reactome. Datasets available from many sources, for instace IntAct, DIP and MINT. Interchange of chemical information over the Internet and other networks networks. More stability and finegrained modelling of nucleotide sequence information. The purpose of INSDSeq is to provide a near-uniform representation for sequence records records. NCBI uses ASN.1 for the storage and retrieval of data such as nucleotide and protein sequences. Data encoded in ASN.1 can be transferred to XML. Facilitate the interchange of data for more efficient communication within the life sciences community. A proteomics-oriented markup language for exchanging proteome data between researchers. To facilitate the exchange of microarray information Molecular browsers, editors. BioCYC. API support in BioJavaX. EMBL. API support in BioJavaX. EMBL, DBJ and GenBank. SRI's BioWarehouse and ProteinStructureFactory's ORFer. Entrez. Labbook's Genomic Browser and Sequence Viewer. Converters. Previously provided by EMBL. SBML 2 2003 Systems Biology Workbench development group. PSI MI 2.5 2005 Proteomics Standards Initiative. BioPAX 2 2005 The BioPAX group. University of Auckland and Physiome Sciences, Inc. Peter Murray-Rust, Henry S. Rzepa. CellML 1.1 2002 CML 2.2 2003 EMBLxml 1.0 2005 EBI. INSDseq 1.4 2005 Seqentry n/a International Nucleotide Sequence Database Collaboration Collaboration. NCBI. n/a What do the standards contain? Defined by BSML 3.1 2002 Labbook.com. HUP-ML 0.8 2003 JHUPO. MAGEML 1.1 2003 MGED. Tools for viewing and analysis. Existing tools for OWL such as Protégé. Datasets available from Reactome. Tools for publication, visualization, creation and simulation. CellML Model Repository (~240 models). ¾ Information about objects: ¾ Proteins/Complexes ¾ Genes/DNA ¾ Other molecules ¾ Interaction information ¾ Information about experiments ¾ Kind of experiment ¾ Evidence of the experiment ¾ More …. HUP-ML Editor. Converters. november 2008 ArrayExpress. BioPAX 14 Summary of standards id name 1 name names dc:title 2 xref cmeta:bio_entity 21 november 2008 17 interactortype organism ncbiTaxId cmeta:species celltype compartment (~ group) tissue cmeta:sex sequence variable initialAmount initialvalue initialConcentration substanceUnits spatialsizeUnits hasonlysubstanceUnits boundaryCondition charge constant SL SL SL SL SL L SL L U L L L L L L U S L S S S SO S S 16 Complex C Components t Organism dcterms:alternative compartment SL SL SL SL SL L SOL SL L S S U Experriments id names SOL SOL SOL S S Type of objects 2 speciesType UL SOL SOL L S CellML:component 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 UL SOL SOL L Organnism PSI MI: Interactor UL SOL SOL L Comppartments SBML: Species SBML PSI MI BioPAX CellML CML EMBLxml INSDseq Seqentry BSML HUP-ML MAGE-ML mzXML mzData AGML Pathw ways Representation of objects Interaactions november 2008 Otther 15 Prrotein november 2008 Substances DN NA, RNA ¾ Defined by: The BioPAX working groups. ¾ Aim: Ai G Generall fformatt for f interactions. ¾ Ontology, implemented in OWL ¾ Pathway and molecular interactions. Name units 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Physical Entity Name Shortname Availability Protein DNA complex Gene DNA Organism Sequence Protein protein complex Complex Interactor type Protein Organism Sequence RNA Organism Sequence Smallmolecules Chemical Formula Molecular weight Granularity Ribonucleoprotein complex Small molecule Unknown participant Deoxyrbonucleic acid Biopolymer Nucleic acid Ribonucleic acid Interaction Protein Peptide 21 november 2008 18 3 Representation of interactions CellML: component 1 2 3 4 5 6 7 8 name november 2008 PSI MI: Interaction imexID id xref variable reaction sboTerm variable-ref 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 SBML: Reaction id name role reactant product modifier id name interactiontype experimentList participantList id names experimental-role biological-role participantidentification experimentalpreparation confidencelist sboTerm direction delta_variable stoichiometry stoichiometry kineticLaw inferredInteractionlist participants modelled confidencelist reversible Interaction types reversible fast 1 2 3 4 5 6 7 8 Interaction Name Shortname Availability Evidence Participants Complex Assembly Transport with biochemical reaction Suppression Genetic interaction Synthetic phenotype Interaction type Physical interaction Direct interaction Colocalisation Covalent binding Enzymatic reaction 20 Bioinformatics: Minimal Information Standards. Part 2: Tools ¾ MIRIAM : Minimal Information Requested in the Annotation of biochemical Models ¾ MIAPE: The Minimum Information About a Proteomics Experiment ¾ Many of the standards are supported by software tools: ¾ SBML Tools: htt // b l http://sbml.org/SBML_Software_Guide/SBML_Software_Matrix /SBML S ft G id /SBML S ft M ti MIAPE: GE(Gel Electrophoresis) MIAPE: MS (Mass Spectrometry) MIAPE: CC (Column Chromatography) MIAPE: CE (Capillary Electrophoresis) …. ¾ CellML Tools: http://www.cellml.org/tools ¾ Data management ¾ Scientific workflows ¾ MIMIx: The minimum information required for reporting a molecular interaction experiment …. ¾ …. november 2008 Biochemical reaction Delta-G Delta-H Delta H Delta-S EC-number KEQ Transport november 2008 ¾ ¾ ¾ ¾ ¾ november 2008 Catalysis Cofactor Direction Modulation Physical Interaction Interactiontype Conversion Left Right Spontaneous 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 19 Control Control type Controller Controlled 21 november 2008 22 Data management of XML data XML as a data model ¾ How can the data be efficiently stored and accessed? ¾ Native XML databases ¾ Hybrid solutions ¾ XML provides a data model ¾ The valid XML data structures can be defined by 23 ¾ XML Schema ¾ DTD ¾ XML has its own query languages ¾ XPath Q y ¾ XQuery november 2008 24 4 Expressing Queries in XQuery Storage possibilities for XML Storage g Q Querying i design Loading XML XQuery XML XML results schema Docs Find information on a given protein. Protein id is given. document("rat_small.xml")//proteinInteractor[@id="EBI-77471"] Storage Loading design g XML schema Find the protein information for the proteins that participate in a given interaction. Interaction id is given. g g 25 Querying Mapping layer XML XQuery results Tuples Relational schema SQL Relational results Relational DBMS Native XML DBMS for $ref in document("rat_small.xml")//interaction [ [names/shortLabel="interaction1"] ] /participantList/proteinParticipant/proteinInteractorRef/@ref return document("rat_small.xml")//proteinInteractor[@id=$ref] november 2008 XML Docs Shredding november 2008 XML as a data model 26 How to shred XML? <?xml version="1.0" encoding="UTF-8"?> ¾ XML is richer than the relational model ¾ Tree structure, ¾ Order ¾ … ¾ Vary from highly structured to unstructured ¾ Database export ¾ … ¾ Annotated text documents ¾ Can C contain links to other type off entities november 2008 27 november 2008 Hybrid XML Storage XML data SQL/XQuery XML /Relational results Native Hybrid Hybrid DBMS november 2008 29 Id Pid 0 - Family Id Pid 1 0 Parent Id Pid Name Job 2 1 Lena Lektor Source Ordinal attrName isValue Value Child 0 1 Families False 1 Id Pid Name School 1 1 Family False 2 3 1 Ludvig g Skolan 2 1 Parent False 3 3 1 Name True Lena 3 2 Job True Docent .. 28 New possibilities… XML data XQuery XML results Mapping layer Families <families xmlns:xsi=http://www.w3.org/2001/XMLSchem a-instance> <family> <parent> <name>Lena</name> <job>Lektor</job> </parent> <child> hild <name>Ludvig</name> <school>Skolan</school> </child> </family> </families> <model id="Tyson1991CellModel_6" name="Tyson1991_CellCycle_6var"> <listOfSpecies> + <species id="C2" name="cdc2k" compartment="cell"> + <species i id="M" id "M" name="p-cyclin_cdc2" " li d 2" compartment="cell"> t t " ll" + <species id="YP" name="p-cyclin" compartment="cell"> … </listOfSpecies> <listOfReactions> <reaction id="Reaction1" name="cyclin_cdc2k dissociation"> <annotation> <rdf:li rdf:resource="http://www.reactome.org/#REACT_6308"/> <rdf:li rdf:resource="http://www.geneontology.org/#GO:0000079"/> </annotation> <listOfReactants> <speciesReference species= species="M"/> M /> </listOfReactants> <listOfProducts> <speciesReference species="C2"/> <speciesReference species="YP"/> </listOfProducts> <kineticLaw> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <times/> <ci> k6 </ci> <ci> M </ci> </apply></math> <listOfParameters> <parameter id="k6" value="1“> </listOfParameters> </kineticLaw> </reaction> + <reaction id="Reaction2" name="cdc2k phosphorylation"> ... more reactions </listOfReactions> </model> </sbml> november 2008 30 Species: Id Name C2 cdc2k cell M p-cyclin_cdc2 cell YP …. p-cyclin …. Compartment cell …. Reaction: Id Reaction1 Name cyclin_cdc2k dissociation Annotation <annotation …..> Formula <kinetic_law …..> Reaction2 cdc2k phosphorylation <annotation ….> <kinetic_law ….> …. …. …. …. Reactants: Id Reaction1 Reaction2 …. Species M …. …. Products: Id Reaction1 Reaction1 …. Species S i C2 YP …. 5 SQL and Xpath/XQuery Efficiency: Increasing query complexity select, r.name, s.name from reaction r, products p, species s where r.id = p.id and p.species=s.id; xquery for $y in db2-fn:xmlcolumn('SBML_DATA.SBML_DOC') /model/listOfReactions/reaction/listOfModifiers/modifierSpeciesReference, $z in db2-fn:xmlcolumn('SBML_DATA.SBML_DOC')/model/listOfSpecies/species[@id = $y/@species] return <product> {$y/../../@name} {$z/@name} </product> 4000 Native november 2008 31 Designed shredding 2000 SELECT p.reaction, species.name from species, (SELECT X.* FROM reactome_data, XMLTABLE ('$d/model/listOfReactions/reaction/listOfModifiers/modifierSpeciesReference' passing reactome_doc as "d" COLUMNS product VARCHAR(200) PATH '@species', reaction VARCHAR(200) PATH '../../@id') AS X) p where p.product=species.id Automatic shredding 0 Species november 2008 Efficiency: Combining representations Path Path (2 step) Path (3 step) Path (4 step) 32 Efficiency: Return the result as XML 500 400 4000 XQuery 300 XQuery (rdbms 1) 200 SQL (rdbms1) SQL (designed) SQL (automatic) 2000 Mixed 100 Species Species Species Reaction Reaction Reaction (1) (100) (1000) (1) (500) (10000) 0 3.4 november 2008 33 november 2008 34 So… Tool development: ShreX ¾ Native databases – ¾ Need a tool to speed up the process ¾ Easy to use but sometimes too inefficient ¾ Automatic shredding – XML data Queries Relational schema and Shredding rules DBMS ¾ Can give datamodels that are hard to work with ¾ Manual shredding /Hybrid solutions – XML Schema + Annotations ¾ Requires time consuming design ¾ Extension of an old tool to allow hybrid storage november 2008 35 november 2008 36 6 Demo ShreX november 2008 37 Workflows for data exploration november 2008 Capturing provenance 38 The Vistrails system (Freire et al. University of Utah) ¾ Provenance of scientific artifacts is necessary to reproduce, d validate lid t and d share h scientific i tifi results lt ¾ Provenance can be as important as the results! ¾ Vision: Provenance enable the world ¾ Comprehensive provenance infrastructure for computational tasks ¾ Captures provenance transparently ¾ Provides intuitive query interfaces for exploring provenance data ¾ Supports collaboration ¾ Designed to support exploratory tasks such as visualization and data mining ¾ VisTrails system is open source: www.vistrails.org ¾ >2,000 2 000 downloads d l d since i b beta t release l iin JJan 2007 ¾ 100% Python--runs on Windows, Linux and Mac november 2008 39 november 2008 Demo Vistrails 40 Enhanced functionality ¾ ¾ ¾ ¾ november 2008 41 november 2008 Parameter exploration Query to find workflows of interest Compute difference between workflows C t an analogy Create l b based d on computed t d diff differences 42 7 Questions? I t Interesting ti issues: i ¾Reuse of other peoples efforts ¾ Provenance server ¾ Co-work ¾Bio-specific version of Vistrails Thanks! ¾ Handling SBML - Libsbml ¾ Annotations A i – use off ontologies l i ¾ Easy to use module library ¾ Combining webdata, own results and visualization november 2008 43 november 2008 Working with ShreX: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:shrex="http://www.cse.ogi.edu/shrex"> <xs:element name="families"> <xs:complexType> xs complexType <xs:sequence maxOccurs="unbounded"> <xs:element name="family" type="familyType"/> </xs:sequence> </xs:complexType> </xs:element> november 2008 Working with ShreX: <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:shrex="http://www.cse.ogi.edu/shrex"> Families Id Pid 0 - <xs:element name="families"> <xs:complexType> xs complexType <xs:sequence maxOccurs="unbounded"> <xs:element name="family" type="familyType"/> </xs:sequence> </xs:complexType> </xs:element> Families_family Id Pid 1 0 <xs:complexType name="familyType"> <xs:sequence> <xs:element name="parent" type="parentType" > <xs:element name="child" type="childType" > </xs:sequence> </xs:complexType> Families_family_parent <xs:complexType name="parentType"> <xs:sequence> <xs:element name= name "name" name type= type "xs:string"/> xs:string /> <xs:element name="job" type="xs:string"/> </xs:sequence> </xs:complexType> Id Id Pid Name Job 2 1 Lena Lektor <xs:complexType name="familyType"> <xs:sequence> <xs:element name="parent" type="parentType" > <xs:element name="child" type="childType" shrex:maptoxml=“true”> </xs:sequence> </xs:complexType> Families_family_child Pid 3 Name 1 School Ludvig g november 2008 Working with ShreX: <xs:element name="families"> <xs:complexType> xs complexType <xs:sequence maxOccurs="unbounded"> <xs:element name="family" type="familyType"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="familyType"> <xs:sequence> <xs:element name="parent" type="parentType" shrex:tablename=“person”> <xs:element name="child" type="childType" Pid 0 - Families_family Id Pid Child 1 0 <child> <name>Ludvig</name> <school>Skolan/school> </child> Families family parent Families_family_parent Id Pid Name Job 2 1 Lena Lektor <xs:complexType name="childType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="school" type="xs:string"/> 46 </xs:sequence> <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:shrex="http://www.cse.ogi.edu/shrex"> Families Id Pid 0 - <xs:element name="families"> <xs:complexType> xs complexType <xs:sequence maxOccurs="unbounded"> <xs:element name="family" type="familyType"/> </xs:sequence> </xs:complexType> </xs:element> Families_family Id Pid 1 0 <xs:complexType name="familyType"> <xs:sequence> <xs:element name="parent" type="parentType" shrex:withparent=“true”> <xs:element name="child" type="childType" > </xs:sequence> </xs:complexType> Person Id Pid Name Job 2 1 Lena Lektor 3 1 Ludvig School Skolan Families Id Pid 0 - Families_family Id Pid Name Job 1 0 Lena Lektor Families_family_child Id Pid Name School 3 1 L d i Ludvig Sk l Skolan <xs:complexType name="parentType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="job" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:complexType name="parentType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name name="job" job type type="xs:string"/> xs:string /> </xs:sequence> </xs:complexType> november 2008 Id Working with ShreX: shrex:tablename=“person” > </xs:sequence> </xs:complexType> <xs:complexType name="childType"> 47 <xs:sequence> Families <xs:complexType name="parentType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="job" type="xs:string"/> </xs:sequence> </xs:complexType> Skolan <xs:complexType name="childType"> name childType > <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="school" type="xs:string"/> </xs:sequence> 45 </xs:complexType> <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:shrex="http://www.cse.ogi.edu/shrex"> 44 november 2008 <xs:complexType name="childType"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="school" type="xs:string"/> 48 </xs:sequence> 8