Data Discovery Genome and Proteome data integration in RDF

advertisement
Semantic Web Applications and Tools for Life Sciences
November 2008
rip
sc
Tr
an
s
Data
ein
G
ot
Pr
me
o
n
e
ts
Genome and Proteome data integration in RDF
Nadia Anwar, Ela Hunt, Walter Kolch and Andy Pitt
Me
tab
olit
es
Discovery
Outline
• Data Integration in Bioinformatics.
• Semantic data integration
• Francisella
• Integrating genome annotations with experimental proteomics data in RDF
• Further work
Data Integration is not a solved problem
Information discovery is not Integrated
High TP
Sequencing
Genomics
Sequence
ORF Prediction
Genome
Comparisons
LIMS
Genome
Microarray
experiments
Computational
analysis
Gene Expression
Transcript Profile
Transcript
Abundance
LIMS
Regulatory Networks
Proteomics
experiments
Computational
analysis
Proteomics
Peptide Profiles
Peptide Abundance
Protein Identification
Protein Interactions
PT-Modifications
LIMS
Metabolic Pathways
Systems Biology
Synthetic Networks/
Pathways
Predictions
Metabolomics
Translational
Medicine
LIMS
Semantic Data Integration across omes data silos
Data Genes
Transcripts
Peptides
Data
Metabolites
Genotype
Discovery
Information
Proof of concept
Francisella tularensis
ulceroglandular
tularaemia
respiratory
tularaemia
oculoglandular
tularaemia
Bioterrorism
• Francisella tularensis is a very successful intracellular pathogen that causes
severe disease (respiratory tulareamia is the most acute form of the disease)
• low infectious dose (10-50 bacterium compared to anthrax which requires
8,000-15,000 spores)
• weaponisation fears
Data sources
Genome
RDF
http://img.jgi.doe.gov/cgi-bin/pub/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=639633024#export
229976
+
(3)IMG_S:genomic_location_strand
229107
TPR
(3)IMG_S:genomic_location_end
(2)RDFS:comment
(3)IMG_S:genomic_location_start
(1)RDF:type (4)IMG:gene_oid=639752258 (3)IMG_S:locus_tag FTN_0209
RDF:description
Data sources
Genome annotations
http://supfam.cs.bris.ac.uk/
RDF#type
http://purl.uniprot.org/core/Protein_Family
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616
RDF:description
SUPERFAMILY:cgi-bin/model.cgi?model=0040419
SUPERFAMILY:Assignment_Region
155-367
SUPERFAMILY:Score
5.1e-39
SUPERFAMILY:SCOP_ID
SUPERFAMILY:cgi-bin/scop.cgi?sunid=52540
SUPERFAMILY:SCOP_Fold
SUPERFAMILY:Family_ID
P-loop containing nucleoside triphosphate hydrolases
SUPERFAMILY:Evalue
81269
SUPERFAMILY:Family_Description
7.33e-06
Extended AAA-ATPase domain
SUPERFAMILY:Similar_Structure
1l8q A:77-289
Francisella SuperFamily Data
Data sources
Genome annotations - KEGG
http://www.genome.jp/dbget-bin/www_bget?pathway+ftn00010
http://img.jgi.doe.gov/schema#gene
http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0298
http://img.jgi.doe.gov/schema#gene_name rdfs:comment rdfs:seeAlso
glpX
fructose
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[EC:3.1.3.11]
rdfs:seeAlso
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-e+[SP:A0Q4N9_FRATN]
http://www.genome.jp/dbget-bin/www_bfind?F.tularensis_U112
Genome annotations - NCBI protein
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id=118496616
RDF:type
RDF:description
YP_897666.1
RDF:idsymbol
RDFS:#seeAlso
http://purl.uniprot.org/Annotation/
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?[refseqp-SeqVersion:YP_897666.1]+-e
chromosomal
http://www.ncbi.nlm.nih.gov/sites/gquery?term=Francisella+tularensis+novicida
Data sources
Genome annotations - GO
RDF:type
RDF:description
mgla:GO_Annotation#ID
http://www.genome.jp/dbget-bin/www_bget?ftn:FTN_0277
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi?view=details&query=0006749
mgla:GO_Annotation#Term
glutathione
mgla:GO_Annotation#Ontology
mgla:GO_Annotation#Level
biological_process
http://www.compbio.dundee.ac.uk/Software/GOtcha/iscore
7
0.879989490261963
http://www.compbio.dundee.ac.uk/Software/GOtcha/cscore
5.7273821328517
Poson annotations - Cogs
https://tools.nwrce.org/cgi-bin/fnu112/poson.cgi?poson=PSN082435
mgla:cogNumber
http://www.ncbi.nlm.nih.gov/sites/entrez?db=cdd&cmd=search&term=COG0508
mgla:cogDomain
mgla:cogDescription
mgla:cogCategory
AceF
Pyruvate/2-oxoglutarate
dihydrolipoamide
Data sources - experiments
Transcriptomics
Data sources - experiments
Proteomics
Proteomics WT vs Mgla Mutant
Francisella tularensis novicida U112
WildType
Whole Cell
(3)
(4)
Soluble
(3)
(4)
Sequest DRAGON
Sequest DRAGON
MglA mutant
Membrane
(3)
(4)
Whole Cell
(3)
(4)
Soluble
(3)
(4)
Membrane
(3)
(4)
Sequest DRAGON
Sequest DRAGON
Sequest DRAGON
Sequest DRAGON
Identification Relative Abundance
P val <0.01
Two-sided t-test
RDF - excel conversion
Pval
Pval-1
analysis
Identified Peptide
abundance
Genome
mgla:poson
mgla:experiment
PSN
rdfs:seeAlso
PSNV2
PSNV3
rdfs:seeAlso
rdfs:seeAlso
FTN
rdfs:seeAlso
DDBID
Peptide
sequence
subject
predicate
object
GO
SP
EC
Data integration
Reconciled Identifiers
(WashU-B) PSN.V1
(COGs) COGID
(NCBI) PROTEINID
(WashU-B) PSN.V2
(WashU-B) PSN.V3
(IMG) GENEID
(Fn ORF ID) FTN
(Gene Ontology) GOID
(ENZYME) E.C.No
(WashU-P) DDB
(Refseq) ACNo
(Uniprot) ACNo
Data Integration
Adding new experiments
Experiment
2
Experiment
1
PSN
rdfs:seeAlso
PSNV2
Public
domain data
PSNV3
rdfs:seeAlso
rdfs:seeAlso
FTN
rdfs:seeAlso
Experiment
3
DDBID
Experiment
4
GO
AC No.
EC
Data integration
Sesame
NadiaAnwar:~ nadia$ openrdf-sesame-2.1/bin/console.sh
Connected to default data directory
Commands end with '.' at the end of a line
Type 'help.' for help
> connect http://127.0.0.1:8080/openrdf-sesame/.
Disconnecting from default data directory
Connected to http://127.0.0.1:8080/openrdf-sesame/
> show r.
+---------|SYSTEM ("System configuration repository")
|ftnRepoNative ("Francisella Test")
|FrancisellaNative ("FrancisellaTestStore")
|FrancisellaReified ("Native store with RDF Schema inferencing")
|FrancisellaReified_index2 ("Native store with RDF Schema inferencing")
|Francisella ("Native store with RDF Schema inferencing")
+---------> open FrancisellaReified_index2.
Opened repository 'FrancisellaReified_index2'
Sesame
Data load (ftnRepoNative) - native (spoc,posc)
Data File
time (s)
triples
francisella_locus_tag.nt
8.93
1,767
interact-prot.nt
88.51
20,682
248,647
interact-prot-peptides.nt
mgla search db.fasta.blastp4 ypURL.n3
9.7
1,719
NC_008601.nt
43.14
12,781
Ft_novicidaU112go.nt
359.14
2,548
francisella.rdf2.nt
43.41
10,434
francisellaSUPERFAMILY.nt
57.88
16,110
francisellaPROTEIN.fasta.nt
13.63
5,160
Soluble.nt
588.87
336,761
WholeCell.nt
469.02
112,625
Membranes.nt
1003.19
298,771
Data Integration
Mgla data (ftnRepoNative)
analysis
Identified Peptide
mgla:poson
PSN
abundance
rdfs:seeAlso
PSNV2
PSNV3
rdfs:seeAlso
rdfs:seeAlso
FTN
rdfs:seeAlso
Experiment
Peptide
sequence
DDBID
SELECT psn, ftn, ec FROM
{ftn} rdfs:seeAlso {ec},
GO
{psn} rdfs:seeAlso {ftn},
{analysis} mgla:poson {psn}
WHERE ec LIKE “*[EC:*”
USING NAMESPACE
mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>
SP
EC
Data Integration
Mgla data (ftnRepoNative)
analysis
rdf:about
Identified Peptide
abundance
mgla:poson
mgla:sequence
mgla:experiment
Peptide
sequence
PSN
rdfs:seeAlso
PSNV2
PSNV3
rdfs:seeAlso
rdfs:seeAlso
FTN
rdfs:seeAlso
DDBID
SELECT abundance, psn, ec, ftn FROM
{ftn} rdfs:seeAlso {ec},
{psn} rdfs:seeAlso {ftn},
GO
{analysis} mgla:poson {psn},
{analysis} mgla:experiment {abundance},
WHERE ec LIKE “*[EC:*”
USING NAMESPACE
mgla =<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>
SP
EC
Really easy, But....
• Simple excel to RDF conversion does not enable all queries
• Not a simple conversion - Data needs to be “modelled”
analysis
Identified Peptide
abundance
mgla:poson
mgla:sequence
mgla:experiment
Peptide
sequence
PSN
Peptide Sequence
identifiedIn
{
rdf:about
Experiment
Replicate
hasAbundance
abundance
Data Integration
Reified statements
rdf:type
analysis
mgla:poson
PSN
Experiment
Replicate
ct
e
j
ob
:
f
rd
analysis data
rdfs:seeAlso
PSNV2
PSNV3
rdfs:seeAlso
rdfs:seeAlso
FTN
DDBID
rdf:Statement
ica
ed
pr
f:
rd
rdf:s
ubje
ct
te
analysis data
InExperimentReplicate
abundance
Peptide
sequence
rdfs:seeAlso
rdf:type
mgla:PeptideAbundance
Identified Peptide
GO
SP
EC
Sesame
Reified Data load - native-RDFS (spoc,posc,posc)
Data File
time (s)
triples
time(mins)
FnU112Version3.nt
383.44
6.3
58,474
PosonMappings.nt
84.56
1.4
13,760
francisella_locus_tag.nt
16.73
0.3
1,767
ConstructHasGeneID.nt
23.00
0.4
1,719
interact-prot.nt
124.95
2.1
20,682
interact-prot-pepteides.nt
1127.97
18.7
248,647
interact-protSeeAlsoisbURL.nt
10.67
0.2
1,528
goAnnotation_URLID.nt
74.14
1.2
20,501
NC_008601.nt
75.84
1.3
12,781
Membranes_CogNumberURL.nt
8.60
0.1
2,548
Ft_novicida_U112_go.nt
561.38
9.3
2,548
francisella.rdf2.nt
46.19
0.8
10,602
francisellaSUPERFAMILY.nt
66.67
1.1
16,110
francisellaPROTEIN.fasta.nt
15.27
0.3
5,160
SolubleReifeid_3.rdf
1392.98
23.2
580,873
WholeCellReified_3.rdf
941.16
15.6
184,221
Membranes_3.rdf
1026.66
17.111
416,086
fnU112_draftRDFschemaV4.nt
215010.98
3,583.5
501
Queries
which posons have the most highly abundant peptides
select ftn , psn, exp, abundance from
{psn} rdfs:seeAlso {psnv2},
{psnv2} rdfs:seeAlso {psnv3},
{psnv3} rdfs:seeAlso {ftn},
{analysis} fnu112:poson {psn},
{analysis} rdf:type {rdf:Statement},
{analysis} rdf:object {exp},
{analysis} mgla:PeptideAbundance {abundance}
where xsd:integer(abundance) > 100000
and ftn LIKE "*FTN*"
using namespace
mgla=<http://www.francisella.org/novicida/schema/fnu112/experiments/mgla/>,
fnu112=<http://www.francisella.org/novicida/fnu112/schema/fnu112/experiments/
mgla#>
Queries
which posons have the most highly abundant peptides
Queries
which experiments have the most highly abundant peptides
Reified statements
• Reified mgla data are much bigger (4 more statements/abundance)
• The really interesting queries return Java out of memory error (-Xms-1024M Xmx 1536M)
• Haven’t yet tested shortcut path expression
{
Peptide Sequence
identifiedIn
Experiment
Replicate
hasAbundance
{ {reifSubj} reifPred {reifObj} } pred {obj}
{ {seq} identifiedIn {ExpRep} } hasAbundance {abd}
abundance
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/1999/02/22-rdf-syntax-ns#Statement>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#subject> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/WholeCell_Lvl7_02.1>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#predicate> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/InExperimentReplicate>.
<#WholeCell_Lvl7_02.12> <http://www.w3.org/1999/02/22-rdf-syntax-ns#object> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/wildtype/01_wc_01>.
<#WholeCell_Lvl7_02.12> <http:/www.francisella.org/novicida/schema/fnu112/experiments/mgla/PeptideAbundance> "2594".
Comparison of integrated experimental data
Distinct and overlapping posons identified within each biological fraction (>20000)
171
mem
185
146
sol
sol MINUS mem
mem MINUS sol
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*mem*"
MINUS
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*sol*"
using namespace
INTERSECT
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*sol*"
INTERSECT
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*mem*"
using namespace
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*sol*"
MINUS
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) > 20000
and experiment LIKE "*mem*"
using namespace
Comparison of integrated experimental data
Distinct and overlapping posons identified within each biological fraction (<5000)
219
mem
245
125
sol
sol MINUS mem
mem MINUS sol
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*mem*"
MINUS
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*sol*"
using namespace
INTERSECT
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*sol*"
INTERSECT
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*mem*"
using namespace
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*sol*"
MINUS
select distinct psn from
{x} fns:poson {psn},
{x} fn:InExperimentReplicate {experiment},
{analysis} rdf:subject {x},
{analysis} rdf:object {exp},
{analysis} fn:PeptideAbundance {abundance}
where xsd:integer(abundance) < 5000
and experiment LIKE "*mem*"
using namespace
Further work
• Queries are slow in the native repository, database repositories are probably
faster.
• Adding transcriptomic experiment:
Wt Vs mglA mutant
GEO AC GSE5468
• RDF-S inferencing?
Acknowledgements
• Funding: BBSRC -Radical Solutions for Researching the Proteome
• University of Glasgow, Glasgow
• Prof. Walter Kolch
• Dr Andy Pitt
• University of Strathclyde, Glasgow
• Dr Ela Hunt (Scientific Advisor)
• University of Washington, Seattle
• Prof. Dave Goodlett (Scientific Advisor)
• Dr Mitch Brittnacher, Mathew Radey, Laurence Rohmer
• Dr Tina Guina (MglA experiment)
Abundance thresholds....
• SeRQL aggregate functions would be nice to have
• Queries to find low and high abundance values:
• WHERE abundance BETWEEN MEDIAN(abundance) AND
MAX(abundance)
• WHERE abundance BETWEEN MIN(abundance) and MEDIAN(abundance)
Download