Semantic Web Applications and Tools for Life Sciences November 2008 rip sc Tr an s Data ein G ot Pr me o n e ts Genome and Proteome data integration in RDF Nadia Anwar, Ela Hunt, Walter Kolch and Andy Pitt Me tab olit es Discovery Outline • Data Integration in Bioinformatics. • Semantic data integration • Francisella • Integrating genome annotations with experimental proteomics data in RDF • Further work Data Integration is not a solved problem Information discovery is not Integrated High TP Sequencing Genomics Sequence ORF Prediction Genome Comparisons LIMS Genome Microarray experiments Computational analysis Gene Expression Transcript Profile Transcript Abundance LIMS Regulatory Networks Proteomics experiments Computational analysis Proteomics Peptide Profiles Peptide Abundance Protein Identification Protein Interactions PT-Modifications LIMS Metabolic Pathways Systems Biology Synthetic Networks/ Pathways Predictions Metabolomics Translational Medicine LIMS Semantic Data Integration across omes data silos Data Genes Transcripts Peptides Data Metabolites Genotype Discovery Information Proof of concept Francisella tularensis ulceroglandular tularaemia respiratory tularaemia oculoglandular tularaemia Bioterrorism • Francisella tularensis is a very successful intracellular pathogen that causes severe disease (respiratory tulareamia is the most acute form of the disease) • low infectious dose (10-50 bacterium compared to anthrax which requires 8,000-15,000 spores) • weaponisation fears Data sources Genome RDF http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=639633024#export 229976 + (3)IMG_S:genomic_location_strand 229107 TPR (3)IMG_S:genomic_location_end (2)RDFS:comment (3)IMG_S:genomic_location_start (1)RDF:type (4)IMG:gene_oid=639752258 (3)IMG_S:locus_tag FTN_0209 RDF:description Data sources Genome annotations http://supfam.cs.bris.ac.uk/ RDF#type http://purl.uniprot.org/core/Protein_Family http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616 RDF:description SUPERFAMILY:cgi-bin/model.cgi?model=0040419 SUPERFAMILY:Assignment_Region 155-367 SUPERFAMILY:Score 5.1e-39 SUPERFAMILY:SCOP_ID SUPERFAMILY:cgi-bin/scop.cgi?sunid=52540 SUPERFAMILY:SCOP_Fold SUPERFAMILY:Family_ID P-loop containing nucleoside triphosphate hydrolases SUPERFAMILY:Evalue 81269 SUPERFAMILY:Family_Description 7.33e-06 Extended AAA-ATPase domain SUPERFAMILY:Similar_Structure 1l8q A:77-289 Francisella SuperFamily Data Data sources Genome annotations - KEGG http://www.genome.jp/dbget-bin/www_bget?pathway+ftn00010 http://img.jgi.doe.gov/schema#gene http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0298 http://img.jgi.doe.gov/schema#gene_name rdfs:comment rdfs:seeAlso glpX fructose http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[EC:3.1.3.11] rdfs:seeAlso http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[SP:A0Q4N9_FRATN] http://www.genome.jp/dbget-bin/www_bfind?F.tularensis_U112 Genome annotations - NCBI protein http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616 RDF:type RDF:description YP_897666.1 RDF:idsymbol RDFS:#seeAlso http://purl.uniprot.org/Annotation/ http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[refseqp-SeqVersion:YP_897666.1]+-e chromosomal http://www.ncbi.nlm.nih.gov/sites/gquery?term=Francisella+tularensis+novicida Data sources Genome annotations - GO RDF:type RDF:description mgla:GO_Annotation#ID http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0277 http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?view=details&query=0006749 mgla:GO_Annotation#Term glutathione mgla:GO_Annotation#Ontology mgla:GO_Annotation#Level biological_process http://www.compbio.dundee.ac.uk/Software/GOtcha/iscore 7 0.879989490261963 http://www.compbio.dundee.ac.uk/Software/GOtcha/cscore 5.7273821328517 Poson annotations - Cogs https://tools.nwrce.org/cgi-bin/fnu112/poson.cgi?poson=PSN082435 mgla:cogNumber http://www.ncbi.nlm.nih.gov/sites/entrez?db=cdd&cmd=search&term=COG0508 mgla:cogDomain mgla:cogDescription mgla:cogCategory AceF Pyruvate/2-oxoglutarate dihydrolipoamide Data sources - experiments Transcriptomics Data sources - experiments Proteomics Proteomics WT vs Mgla Mutant Francisella tularensis novicida U112 WildType Whole Cell (3) (4) Soluble (3) (4) Sequest DRAGON Sequest DRAGON MglA mutant Membrane (3) (4) Whole Cell (3) (4) Soluble (3) (4) Membrane (3) (4) Sequest DRAGON Sequest DRAGON Sequest DRAGON Sequest DRAGON Identification Relative Abundance P val <0.01 Two-sided t-test RDF - excel conversion Pval Pval-1 analysis Identified Peptide abundance Genome mgla:poson mgla:experiment PSN rdfs:seeAlso PSNV2 PSNV3 rdfs:seeAlso rdfs:seeAlso FTN rdfs:seeAlso DDBID Peptide sequence subject predicate object GO SP EC Data integration Reconciled Identifiers (WashU-B) PSN.V1 (COGs) COGID (NCBI) PROTEINID (WashU-B) PSN.V2 (WashU-B) PSN.V3 (IMG) GENEID (Fn ORF ID) FTN (Gene Ontology) GOID (ENZYME) E.C.No (WashU-P) DDB (Refseq) ACNo (Uniprot) ACNo Data Integration Adding new experiments Experiment 2 Experiment 1 PSN rdfs:seeAlso PSNV2 Public domain data PSNV3 rdfs:seeAlso rdfs:seeAlso FTN rdfs:seeAlso Experiment 3 DDBID Experiment 4 GO AC No. EC Data integration Sesame NadiaAnwar:~ nadia$ openrdf-sesame-2.1/bin/console.sh Connected to default data directory Commands end with '.' at the end of a line Type 'help.' for help > connect http://127.0.0.1:8080/openrdf-sesame/. Disconnecting from default data directory Connected to http://127.0.0.1:8080/openrdf-sesame/ > show r. +---------|SYSTEM ("System configuration repository") |ftnRepoNative ("Francisella Test") |FrancisellaNative ("FrancisellaTestStore") |FrancisellaReified ("Native store with RDF Schema inferencing") |FrancisellaReified_index2 ("Native store with RDF Schema inferencing") |Francisella ("Native store with RDF Schema inferencing") +---------> open FrancisellaReified_index2. Opened repository 'FrancisellaReified_index2' Sesame Data load (ftnRepoNative) - native (spoc,posc) Data File time (s) triples francisella_locus_tag.nt 8.93 1,767 interact-prot.nt 88.51 20,682 248,647 interact-prot-peptides.nt mgla search db.fasta.blastp4 ypURL.n3 9.7 1,719 NC_008601.nt 43.14 12,781 Ft_novicidaU112go.nt 359.14 2,548 francisella.rdf2.nt 43.41 10,434 francisellaSUPERFAMILY.nt 57.88 16,110 francisellaPROTEIN.fasta.nt 13.63 5,160 Soluble.nt 588.87 336,761 WholeCell.nt 469.02 112,625 Membranes.nt 1003.19 298,771 Data Integration Mgla data (ftnRepoNative) analysis Identified Peptide mgla:poson PSN abundance rdfs:seeAlso PSNV2 PSNV3 rdfs:seeAlso rdfs:seeAlso FTN rdfs:seeAlso Experiment Peptide sequence DDBID SELECT psn, ftn, ec FROM {ftn} rdfs:seeAlso {ec}, GO {psn} rdfs:seeAlso {ftn}, {analysis} mgla:poson {psn} WHERE ec LIKE “*[EC:*” USING NAMESPACE mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/> SP EC Data Integration Mgla data (ftnRepoNative) analysis rdf:about Identified Peptide abundance mgla:poson mgla:sequence mgla:experiment Peptide sequence PSN rdfs:seeAlso PSNV2 PSNV3 rdfs:seeAlso rdfs:seeAlso FTN rdfs:seeAlso DDBID SELECT abundance, psn, ec, ftn FROM {ftn} rdfs:seeAlso {ec}, {psn} rdfs:seeAlso {ftn}, GO {analysis} mgla:poson {psn}, {analysis} mgla:experiment {abundance}, WHERE ec LIKE “*[EC:*” USING NAMESPACE mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/> SP EC Really easy, But.... • Simple excel to RDF conversion does not enable all queries • Not a simple conversion - Data needs to be “modelled” analysis Identified Peptide abundance mgla:poson mgla:sequence mgla:experiment Peptide sequence PSN Peptide Sequence identifiedIn { rdf:about Experiment Replicate hasAbundance abundance Data Integration Reified statements rdf:type analysis mgla:poson PSN Experiment Replicate ct e j ob : f rd analysis data rdfs:seeAlso PSNV2 PSNV3 rdfs:seeAlso rdfs:seeAlso FTN DDBID rdf:Statement ica ed pr f: rd rdf:s ubje ct te analysis data InExperimentReplicate abundance Peptide sequence rdfs:seeAlso rdf:type mgla:PeptideAbundance Identified Peptide GO SP EC Sesame Reified Data load - native-RDFS (spoc,posc,posc) Data File time (s) triples time(mins) FnU112Version3.nt 383.44 6.3 58,474 PosonMappings.nt 84.56 1.4 13,760 francisella_locus_tag.nt 16.73 0.3 1,767 ConstructHasGeneID.nt 23.00 0.4 1,719 interact-prot.nt 124.95 2.1 20,682 interact-prot-pepteides.nt 1127.97 18.7 248,647 interact-protSeeAlsoisbURL.nt 10.67 0.2 1,528 goAnnotation_URLID.nt 74.14 1.2 20,501 NC_008601.nt 75.84 1.3 12,781 Membranes_CogNumberURL.nt 8.60 0.1 2,548 Ft_novicida_U112_go.nt 561.38 9.3 2,548 francisella.rdf2.nt 46.19 0.8 10,602 francisellaSUPERFAMILY.nt 66.67 1.1 16,110 francisellaPROTEIN.fasta.nt 15.27 0.3 5,160 SolubleReifeid_3.rdf 1392.98 23.2 580,873 WholeCellReified_3.rdf 941.16 15.6 184,221 Membranes_3.rdf 1026.66 17.111 416,086 fnU112_draftRDFschemaV4.nt 215010.98 3,583.5 501 Queries which posons have the most highly abundant peptides select ftn , psn, exp, abundance from {psn} rdfs:seeAlso {psnv2}, {psnv2} rdfs:seeAlso {psnv3}, {psnv3} rdfs:seeAlso {ftn}, {analysis} fnu112:poson {psn}, {analysis} rdf:type {rdf:Statement}, {analysis} rdf:object {exp}, {analysis} mgla:PeptideAbundance {abundance} where xsd:integer(abundance) > 100000 and ftn LIKE "*FTN*" using namespace mgla=<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>, fnu112=<http://www.francisella.org/novicida/fnu112/schema/fnu112/experiments/ mgla#> Queries which posons have the most highly abundant peptides Queries which experiments have the most highly abundant peptides Reified statements • Reified mgla data are much bigger (4 more statements/abundance) • The really interesting queries return Java out of memory error (-Xms-1024M Xmx 1536M) • Haven’t yet tested shortcut path expression { Peptide Sequence identifiedIn Experiment Replicate hasAbundance { {reifSubj} reifPred {reifObj} } pred {obj} { {seq} identifiedIn {ExpRep} } hasAbundance {abd} abundance <#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement>. <#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/WholeCell_Lvl7_02.1>. <#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/InExperimentReplicate>. <#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#object> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/wildtype/01_wc_01>. <#WholeCell_Lvl7_02.12> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/PeptideAbundance> "2594". Comparison of integrated experimental data Distinct and overlapping posons identified within each biological fraction (>20000) 171 mem 185 146 sol sol MINUS mem mem MINUS sol select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*mem*" MINUS select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*sol*" using namespace INTERSECT select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*sol*" INTERSECT select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*mem*" using namespace select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*sol*" MINUS select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) > 20000 and experiment LIKE "*mem*" using namespace Comparison of integrated experimental data Distinct and overlapping posons identified within each biological fraction (<5000) 219 mem 245 125 sol sol MINUS mem mem MINUS sol select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*mem*" MINUS select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*sol*" using namespace INTERSECT select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*sol*" INTERSECT select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*mem*" using namespace select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*sol*" MINUS select distinct psn from {x} fns:poson {psn}, {x} fn:InExperimentReplicate {experiment}, {analysis} rdf:subject {x}, {analysis} rdf:object {exp}, {analysis} fn:PeptideAbundance {abundance} where xsd:integer(abundance) < 5000 and experiment LIKE "*mem*" using namespace Further work • Queries are slow in the native repository, database repositories are probably faster. • Adding transcriptomic experiment: Wt Vs mglA mutant GEO AC GSE5468 • RDF-S inferencing? Acknowledgements • Funding: BBSRC -Radical Solutions for Researching the Proteome • University of Glasgow, Glasgow • Prof. Walter Kolch • Dr Andy Pitt • University of Strathclyde, Glasgow • Dr Ela Hunt (Scientific Advisor) • University of Washington, Seattle • Prof. Dave Goodlett (Scientific Advisor) • Dr Mitch Brittnacher, Mathew Radey, Laurence Rohmer • Dr Tina Guina (MglA experiment) Abundance thresholds.... • SeRQL aggregate functions would be nice to have • Queries to find low and high abundance values: • WHERE abundance BETWEEN MEDIAN(abundance) AND MAX(abundance) • WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)