Topics Internet Data, standards and tools for the internet Lena Strömbäck

advertisement
Internet
Data, standards and tools for the internet
Lena Strömbäck
november 2008
2
Tertiary str.
PDB
Biological data
Topics
¾ Part 1: Data and representation
¾ Databases on the web
¾ XML standards
¾ Minimal Information Standards
¾ Part 2: Tools
¾ General Overview
¾ Databases and data mangement
¾ Putting things together: Scientific
S
f workflows
f
DNA seq.
GenBank
…
Protein seq.
SWISS PROT
SWISS-PROT
Secondary str.
PROSITE
Signaling pathway
SPAD
Taxonomy
AmiGO
INSULIN
november 2008
3
Biological data sources
Example of web sources
Molecular Biology Database List at Nucleic Acids Research
¾
¾
¾
¾
¾
¾
number o
of data source
es
800
700
600
500
400
Biomodels
Reactome
IntAct
KEGG
Uniprot
CheBi
http://www ebi ac uk/biomodels-main/static-pages
http://www.ebi.ac.uk/biomodels
main/static pages.do?page=home
do?page=home
http://www.reactome.org/
http://www.ebi.ac.uk/intact/site/index.jsf
http://www.genome.jp/kegg/
http://www.uniprot.org/
http://www.ebi.ac.uk/chebi/
300
200
100
0
2000
2001
2002
2003
2004
2005
year
november 2008
5
november 2008
6
1
How to make use of this data?
So …
¾ Availability of web sources within bioinformatics is unique
¾ How can we make best use of them?
Common
terminology
Identify
objects
Representt
R
relations
november 2008
7
november 2008
How to represent data …
8
SBML
¾ Representation of structure or relations
¾ Web data is often less structured than traditional
applications
¾ Defined by: System Biology
Development workbench
group.
¾ Aim: Standard for
information exchange.
¾ Mathematical formulas can
be defined,, discrete events.
¾ Semi structured
¾ Integration
¾ XML is the most common language
¾ OWL and RDF are alternatives
november 2008
9
november 2008
XML model example
<model id="Tyson1991CellModel_6"
name="Tyson1991_CellCycle_6var">
<listOfSpecies>
+ <species id="C2" name="cdc2k" compartment="cell">
+ <species id="M"
id= M name=
name="p
p-cyclin_cdc2
cyclin cdc2" compartment=
compartment="cell">
cell >
+ <species id="YP" name="p-cyclin" compartment="cell"> …
</listOfSpecies>
<listOfReactions>
<reaction id="Reaction1" name="cyclin_cdc2k dissociation">
<annotation>
<rdf:li
rdf:resource="http://www.reactome.org/#REACT_6308"/>
<rdf:li
rdf:resource="http://www.geneontology.org/#GO:0000079"/>
</annotation>
<listOfReactants>
<speciesReference species
species="M"/>
M />
</listOfReactants>
<listOfProducts>
<speciesReference species="C2"/>
<speciesReference species="YP"/>
</listOfProducts>
<kineticLaw>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<apply> <times/> <ci> k6 </ci> <ci> M </ci> </apply></math>
<listOfParameters> <parameter id="k6" value="1“>
</listOfParameters>
</kineticLaw>
</reaction>
+ <reaction id="Reaction2" name="cdc2k phosphorylation">
... more reactions
</listOfReactions>
</model>
</sbml>
november 2008
11
<model>
<listOfCompartments>
<listOfSpecies>
<listOfReactions>
<reaction>
<listOfReactants>
<listOfProducts>
<listOfModifiers>
</reaction>
</listOfReactions>
</model>
10
PSI MI
¾ Defined by: Proteomics
Standards Initiative.
¾ Aim: Standard for
representation of molecular
interaction.
¾ Thorough description of
experiments
i
and
d protein
i
structure.
november 2008
<entry>
<experimentList>
<interactorList>
<interactionList>
<interaction>
<experimentList>
<participList>
p
p
</interaction>
</interactionList>
</entry>
12
2
Available XML standards ….
Name
november 2008
13
Ver.
Year
Purpose
Tools
Data
A computer-readable
format for representing
models of biochemical
reaction networks.
A standard for data
representation for proteinprotein interaction to
facilitate data comparison,
exchange and verification.
A collaborative effort to
create a data exchange
format for biological
pathway
h
d
data.
Support the definition of
models of cellular and
subcellular processes.
Many tools available.
Data available from
many databases, for
instance, KEGG
and Reactome.
Datasets available
from many sources,
for instace IntAct,
DIP and MINT.
Interchange of chemical
information over the
Internet and other
networks
networks.
More stability and finegrained modelling of
nucleotide sequence
information.
The purpose of INSDSeq is
to provide a near-uniform
representation for sequence
records
records.
NCBI uses ASN.1 for the
storage and retrieval of
data such as nucleotide and
protein sequences. Data
encoded in ASN.1 can be
transferred to XML.
Facilitate the interchange
of data for more efficient
communication within the
life sciences community.
A proteomics-oriented
markup language for
exchanging proteome data
between researchers.
To facilitate the exchange
of microarray information
Molecular browsers,
editors.
BioCYC.
API support in
BioJavaX.
EMBL.
API support in
BioJavaX.
EMBL, DBJ and
GenBank.
SRI's BioWarehouse and
ProteinStructureFactory's
ORFer.
Entrez.
Labbook's Genomic
Browser and Sequence
Viewer. Converters.
Previously
provided by
EMBL.
SBML
2
2003
Systems Biology
Workbench
development group.
PSI MI
2.5
2005
Proteomics
Standards
Initiative.
BioPAX
2
2005
The BioPAX
group.
University of
Auckland and
Physiome Sciences,
Inc.
Peter Murray-Rust,
Henry S. Rzepa.
CellML
1.1
2002
CML
2.2
2003
EMBLxml
1.0
2005
EBI.
INSDseq
1.4
2005
Seqentry
n/a
International
Nucleotide
Sequence Database
Collaboration
Collaboration.
NCBI.
n/a
What do the standards contain?
Defined by
BSML
3.1
2002
Labbook.com.
HUP-ML
0.8
2003
JHUPO.
MAGEML
1.1
2003
MGED.
Tools for viewing and
analysis.
Existing tools for OWL
such as Protégé.
Datasets available
from Reactome.
Tools for publication,
visualization, creation
and simulation.
CellML Model
Repository (~240
models).
¾ Information about objects:
¾ Proteins/Complexes
¾ Genes/DNA
¾ Other molecules
¾ Interaction information
¾ Information about experiments
¾ Kind of experiment
¾ Evidence of the experiment
¾ More ….
HUP-ML Editor.
Converters.
november 2008
ArrayExpress.
BioPAX
14
Summary of standards
id
name
1
name
names
dc:title
2
xref
cmeta:bio_entity
21
november 2008
17
interactortype
organism
ncbiTaxId
cmeta:species
celltype
compartment
(~ group)
tissue
cmeta:sex
sequence
variable
initialAmount
initialvalue
initialConcentration
substanceUnits
spatialsizeUnits
hasonlysubstanceUnits
boundaryCondition
charge
constant
SL
SL
SL
SL
SL
L
SL
L
U
L
L
L
L
L
L
U
S
L
S
S
S
SO
S
S
16
Complex
C
Components
t
Organism
dcterms:alternative
compartment
SL
SL
SL
SL
SL
L
SOL SL
L
S
S
U
Experriments
id
names
SOL
SOL
SOL
S
S
Type of objects
2
speciesType
UL
SOL
SOL
L
S
CellML:component
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
UL
SOL
SOL
L
Organnism
PSI MI: Interactor
UL
SOL
SOL
L
Comppartments
SBML: Species
SBML
PSI MI
BioPAX
CellML
CML
EMBLxml
INSDseq
Seqentry
BSML
HUP-ML
MAGE-ML
mzXML
mzData
AGML
Pathw
ways
Representation of objects
Interaactions
november 2008
Otther
15
Prrotein
november 2008
Substances
DN
NA, RNA
¾ Defined by: The BioPAX
working groups.
¾ Aim:
Ai G
Generall fformatt for
f
interactions.
¾ Ontology, implemented in
OWL
¾ Pathway and molecular
interactions.
Name
units
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Physical Entity
Name
Shortname
Availability
Protein DNA complex
Gene
DNA
Organism
Sequence
Protein protein complex
Complex
Interactor type
Protein
Organism
Sequence
RNA
Organism
Sequence
Smallmolecules
Chemical Formula
Molecular weight
Granularity
Ribonucleoprotein complex
Small molecule
Unknown participant
Deoxyrbonucleic acid
Biopolymer
Nucleic acid
Ribonucleic acid
Interaction
Protein
Peptide
21
november 2008
18
3
Representation of interactions
CellML: component
1
2
3
4
5
6
7
8
name
november 2008
PSI MI: Interaction
imexID
id
xref
variable
reaction
sboTerm
variable-ref
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SBML: Reaction
id
name
role
reactant
product
modifier
id
name
interactiontype
experimentList
participantList
id
names
experimental-role
biological-role
participantidentification
experimentalpreparation
confidencelist
sboTerm
direction
delta_variable
stoichiometry
stoichiometry
kineticLaw
inferredInteractionlist
participants
modelled
confidencelist
reversible
Interaction types
reversible
fast
1
2
3
4
5
6
7
8
Interaction
Name
Shortname
Availability
Evidence
Participants
Complex Assembly
Transport with
biochemical reaction
Suppression
Genetic interaction
Synthetic phenotype
Interaction type
Physical interaction
Direct interaction
Colocalisation
Covalent binding
Enzymatic reaction
20
Bioinformatics:
Minimal Information Standards.
Part 2: Tools
¾ MIRIAM : Minimal Information Requested in the Annotation
of biochemical Models
¾ MIAPE: The Minimum Information About a Proteomics
Experiment
¾ Many of the standards are supported by software
tools:
¾ SBML Tools:
htt // b l
http://sbml.org/SBML_Software_Guide/SBML_Software_Matrix
/SBML S ft
G id /SBML S ft
M ti
MIAPE: GE(Gel Electrophoresis)
MIAPE: MS (Mass Spectrometry)
MIAPE: CC (Column Chromatography)
MIAPE: CE (Capillary Electrophoresis)
….
¾ CellML Tools:
http://www.cellml.org/tools
¾ Data management
¾ Scientific workflows
¾ MIMIx: The minimum information required for reporting a
molecular interaction experiment ….
¾ ….
november 2008
Biochemical reaction
Delta-G
Delta-H
Delta
H
Delta-S
EC-number
KEQ
Transport
november 2008
¾
¾
¾
¾
¾
november 2008
Catalysis
Cofactor
Direction
Modulation
Physical Interaction
Interactiontype
Conversion
Left
Right
Spontaneous
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
19
Control
Control type
Controller
Controlled
21
november 2008
22
Data management of XML data
XML as a data model
¾ How can the data be efficiently stored and
accessed?
¾ Native XML databases
¾ Hybrid solutions
¾ XML provides a data model
¾ The valid XML data structures can be defined by
23
¾ XML Schema
¾ DTD
¾ XML has its own query languages
¾ XPath
Q y
¾ XQuery
november 2008
24
4
Expressing Queries in XQuery
Storage possibilities for XML
Storage
g
Q
Querying
i
design Loading
XML XQuery XML
XML
results
schema Docs
Find information on a given protein. Protein id is given.
document("rat_small.xml")//proteinInteractor[@id="EBI-77471"]
Storage
Loading
design
g
XML
schema
Find the protein information for the proteins that participate in a
given interaction. Interaction id is given.
g
g
25
Querying
Mapping layer
XML
XQuery results
Tuples
Relational
schema
SQL
Relational
results
Relational
DBMS
Native XML
DBMS
for $ref in document("rat_small.xml")//interaction
[
[names/shortLabel="interaction1"]
]
/participantList/proteinParticipant/proteinInteractorRef/@ref
return document("rat_small.xml")//proteinInteractor[@id=$ref]
november 2008
XML
Docs
Shredding
november 2008
XML as a data model
26
How to shred XML?
<?xml version="1.0" encoding="UTF-8"?>
¾ XML is richer than the relational model
¾ Tree structure,
¾ Order
¾ …
¾ Vary from highly structured to unstructured
¾ Database export
¾ …
¾ Annotated text documents
¾ Can
C contain links to other type off entities
november 2008
27
november 2008
Hybrid XML Storage
XML data
SQL/XQuery
XML /Relational
results
Native
Hybrid
Hybrid
DBMS
november 2008
29
Id
Pid
0
-
Family
Id
Pid
1
0
Parent
Id
Pid
Name
Job
2
1
Lena
Lektor
Source
Ordinal
attrName
isValue
Value
Child
0
1
Families
False
1
Id
Pid
Name
School
1
1
Family
False
2
3
1
Ludvig
g
Skolan
2
1
Parent
False
3
3
1
Name
True
Lena
3
2
Job
True
Docent ..
28
New possibilities…
XML data
XQuery
XML results
Mapping
layer
Families
<families
xmlns:xsi=http://www.w3.org/2001/XMLSchem
a-instance>
<family>
<parent>
<name>Lena</name>
<job>Lektor</job>
</parent>
<child>
hild
<name>Ludvig</name>
<school>Skolan</school>
</child>
</family>
</families>
<model id="Tyson1991CellModel_6"
name="Tyson1991_CellCycle_6var">
<listOfSpecies>
+ <species id="C2" name="cdc2k" compartment="cell">
+ <species
i id="M"
id "M" name="p-cyclin_cdc2"
"
li
d 2" compartment="cell">
t
t " ll"
+ <species id="YP" name="p-cyclin" compartment="cell"> …
</listOfSpecies>
<listOfReactions>
<reaction id="Reaction1" name="cyclin_cdc2k dissociation">
<annotation>
<rdf:li
rdf:resource="http://www.reactome.org/#REACT_6308"/>
<rdf:li
rdf:resource="http://www.geneontology.org/#GO:0000079"/>
</annotation>
<listOfReactants>
<speciesReference species=
species="M"/>
M />
</listOfReactants>
<listOfProducts>
<speciesReference species="C2"/>
<speciesReference species="YP"/>
</listOfProducts>
<kineticLaw>
<math xmlns="http://www.w3.org/1998/Math/MathML">
<apply> <times/> <ci> k6 </ci> <ci> M </ci> </apply></math>
<listOfParameters> <parameter id="k6" value="1“>
</listOfParameters>
</kineticLaw>
</reaction>
+ <reaction id="Reaction2" name="cdc2k phosphorylation">
... more reactions
</listOfReactions>
</model>
</sbml>
november 2008
30
Species:
Id
Name
C2
cdc2k
cell
M
p-cyclin_cdc2
cell
YP
….
p-cyclin
….
Compartment
cell
….
Reaction:
Id
Reaction1
Name
cyclin_cdc2k dissociation
Annotation
<annotation …..>
Formula
<kinetic_law …..>
Reaction2
cdc2k phosphorylation
<annotation ….>
<kinetic_law ….>
….
….
….
….
Reactants:
Id
Reaction1
Reaction2
….
Species
M
….
….
Products:
Id
Reaction1
Reaction1
….
Species
S
i
C2
YP
….
5
SQL and Xpath/XQuery
Efficiency:
Increasing query complexity
select, r.name, s.name
from reaction r, products p, species s
where r.id = p.id and p.species=s.id;
xquery
for $y in db2-fn:xmlcolumn('SBML_DATA.SBML_DOC')
/model/listOfReactions/reaction/listOfModifiers/modifierSpeciesReference,
$z in db2-fn:xmlcolumn('SBML_DATA.SBML_DOC')/model/listOfSpecies/species[@id = $y/@species]
return <product> {$y/../../@name} {$z/@name} </product>
4000
Native
november 2008
31
Designed shredding
2000
SELECT p.reaction, species.name
from species,
(SELECT X.*
FROM reactome_data,
XMLTABLE ('$d/model/listOfReactions/reaction/listOfModifiers/modifierSpeciesReference' passing
reactome_doc as "d"
COLUMNS
product VARCHAR(200) PATH '@species',
reaction VARCHAR(200)
PATH '../../@id') AS X) p
where p.product=species.id
Automatic shredding
0
Species
november 2008
Efficiency:
Combining representations
Path
Path (2 step) Path (3 step) Path (4 step)
32
Efficiency:
Return the result as XML
500
400
4000
XQuery
300
XQuery (rdbms 1)
200
SQL (rdbms1)
SQL (designed)
SQL (automatic)
2000
Mixed
100
Species Species Species Reaction Reaction Reaction
(1)
(100) (1000) (1)
(500) (10000)
0
3.4
november 2008
33
november 2008
34
So…
Tool development: ShreX
¾ Native databases –
¾ Need a tool to speed up the process
¾ Easy to use but sometimes too inefficient
¾ Automatic shredding –
XML data
Queries
Relational schema
and
Shredding rules
DBMS
¾ Can give datamodels that are hard to work with
¾ Manual shredding /Hybrid solutions –
XML Schema
+
Annotations
¾ Requires time consuming design
¾ Extension of an old tool to allow hybrid storage
november 2008
35
november 2008
36
6
Demo ShreX
november 2008
37
Workflows for data exploration
november 2008
Capturing provenance
38
The Vistrails system
(Freire et al. University of Utah)
¾ Provenance of scientific artifacts is necessary to
reproduce,
d
validate
lid t and
d share
h
scientific
i tifi results
lt
¾ Provenance can be as important as the results!
¾ Vision: Provenance enable the world
¾ Comprehensive provenance infrastructure for computational
tasks
¾ Captures provenance transparently
¾ Provides intuitive query interfaces for exploring provenance data
¾ Supports collaboration
¾ Designed to support exploratory tasks such as visualization and
data mining
¾ VisTrails system is open source: www.vistrails.org
¾ >2,000
2 000 downloads
d
l d since
i
b
beta
t release
l
iin JJan 2007
¾ 100% Python--runs on Windows, Linux and Mac
november 2008
39
november 2008
Demo Vistrails
40
Enhanced functionality
¾
¾
¾
¾
november 2008
41
november 2008
Parameter exploration
Query to find workflows of interest
Compute difference between workflows
C t an analogy
Create
l
b
based
d on computed
t d diff
differences
42
7
Questions?
I t
Interesting
ti issues:
i
¾Reuse of other peoples efforts
¾ Provenance server
¾ Co-work
¾Bio-specific version of Vistrails
Thanks!
¾ Handling SBML - Libsbml
¾ Annotations
A
i
– use off ontologies
l i
¾ Easy to use module library
¾ Combining webdata, own results and visualization
november 2008
43
november 2008
Working with ShreX:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:shrex="http://www.cse.ogi.edu/shrex">
<xs:element name="families">
<xs:complexType>
xs complexType
<xs:sequence maxOccurs="unbounded">
<xs:element name="family" type="familyType"/>
</xs:sequence>
</xs:complexType>
</xs:element>
november 2008
Working with ShreX:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:shrex="http://www.cse.ogi.edu/shrex">
Families
Id
Pid
0
-
<xs:element name="families">
<xs:complexType>
xs complexType
<xs:sequence maxOccurs="unbounded">
<xs:element name="family" type="familyType"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Families_family
Id
Pid
1
0
<xs:complexType name="familyType">
<xs:sequence>
<xs:element name="parent" type="parentType" >
<xs:element name="child" type="childType" >
</xs:sequence>
</xs:complexType>
Families_family_parent
<xs:complexType name="parentType">
<xs:sequence>
<xs:element name=
name "name"
name type=
type "xs:string"/>
xs:string />
<xs:element name="job" type="xs:string"/>
</xs:sequence>
</xs:complexType>
Id
Id
Pid
Name
Job
2
1
Lena
Lektor
<xs:complexType name="familyType">
<xs:sequence>
<xs:element name="parent" type="parentType" >
<xs:element name="child" type="childType"
shrex:maptoxml=“true”>
</xs:sequence>
</xs:complexType>
Families_family_child
Pid
3
Name
1
School
Ludvig
g
november 2008
Working with ShreX:
<xs:element name="families">
<xs:complexType>
xs complexType
<xs:sequence maxOccurs="unbounded">
<xs:element name="family" type="familyType"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:complexType name="familyType">
<xs:sequence>
<xs:element name="parent" type="parentType"
shrex:tablename=“person”>
<xs:element name="child" type="childType"
Pid
0
-
Families_family
Id
Pid
Child
1
0
<child>
<name>Ludvig</name>
<school>Skolan/school>
</child>
Families family parent
Families_family_parent
Id
Pid
Name
Job
2
1
Lena
Lektor
<xs:complexType name="childType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="school" type="xs:string"/>
46
</xs:sequence>
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:shrex="http://www.cse.ogi.edu/shrex">
Families
Id
Pid
0
-
<xs:element name="families">
<xs:complexType>
xs complexType
<xs:sequence maxOccurs="unbounded">
<xs:element name="family" type="familyType"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Families_family
Id
Pid
1
0
<xs:complexType name="familyType">
<xs:sequence>
<xs:element name="parent" type="parentType"
shrex:withparent=“true”>
<xs:element name="child" type="childType" >
</xs:sequence>
</xs:complexType>
Person
Id
Pid
Name
Job
2
1
Lena
Lektor
3
1
Ludvig
School
Skolan
Families
Id
Pid
0
-
Families_family
Id
Pid
Name
Job
1
0
Lena
Lektor
Families_family_child
Id
Pid
Name
School
3
1
L d i
Ludvig
Sk l
Skolan
<xs:complexType name="parentType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="job" type="xs:string"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="parentType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name
name="job"
job type
type="xs:string"/>
xs:string />
</xs:sequence>
</xs:complexType>
november 2008
Id
Working with ShreX:
shrex:tablename=“person” >
</xs:sequence>
</xs:complexType>
<xs:complexType name="childType">
47
<xs:sequence>
Families
<xs:complexType name="parentType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="job" type="xs:string"/>
</xs:sequence>
</xs:complexType>
Skolan
<xs:complexType name="childType">
name childType >
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="school" type="xs:string"/>
</xs:sequence>
45
</xs:complexType>
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:shrex="http://www.cse.ogi.edu/shrex">
44
november 2008
<xs:complexType name="childType">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="school" type="xs:string"/>
48
</xs:sequence>
8
Download