Exploiting Diverse Sources of Scientific Data the vision, what has been achieved

advertisement
Exploiting Diverse Sources of
Scientific Data
the vision,
what has been achieved
and what next…
Prof. Jessie Kennedy
e-SI Theme:
Exploiting Diverse Sources of Scientific Data
Science & Scientific Data
Science and Scientific Data are Complex…
Exploiting Diverse Sources of Scientific Data
Climatology
Hydrology
Meteorology
Geography
Geology
Ecology
Paleontology
Genomics
Taxonomy
Nomenclature
Proteomics
Morphology
Biochemistry
Climatology
Hydrology
Meteorology
Geography
Temperature
Geology
Organism
Ecology
Taxon
concept
Paleontology
Gene
sequence
Genomics
Taxonomy
Proteomics
Name
Nomenclature
Protein
Morphology
Pathway
Biochemistry
Scientific Community: complex
Small Scientific
Community
Individual Scientist
Large Scientific
Community
Exploiting Diverse Sources of Scientific Data
Scientific Laboraotory
Meteorology
Meteorology
Meteorology
Meteorology
Geology
Geology
Geology
Geology
Climatology
Climatology
Climatology
Climatology
Temperature
Temperature
Temperature
Temperature
Hydrology
Hydrology
Hydrology
Hydrology
Geography
Geography
Geography
Geography
Organism
Organism
Organism
Organism
Ecology
Ecology
Ecology
Ecology
Taxon
Gene
Taxon
concept
Gene
Paleontology
Taxon
sequence
concept
Gene
Paleontology
Taxon
sequence
concept
Genomics
Gene
Paleontology
Taxonomy
sequence
concept
Genomics
Proteomics
Paleontology
Taxonomy
Name
sequence
Genomics
Proteomics
Taxonomy
Name
Genomics
Proteomics
Protein
Taxonomy
Name
Proteomics
Protein
Morphology
Name
Nomenclature
Protein
Morphology
Nomenclature
Pathway
Protein
Morphology
Nomenclature
Pathway
Morphology
Nomenclature
Pathway
Biochemistry
Pathway
Biochemistry
Biochemistry
Biochemistry
Science & Scientific Data
Are continually changing
Conclusions become
foundations for new
hypotheses
New experiments invalidate
existing knowledge
Knowledge is open to
interpretation
Different opinions
World continually
changing
conclusion
observation
experiment
hypothesis
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of
Scientific Data: the vision
To provide scientists with technological
solutions to exploit the wealth and diversity of
Scientific Data
Discovery
Access
Sharing
Integration/Linking
Analysis
Which would thereby improve the potential for
new scientific discovery
Exploiting Diverse Sources of Scientific Data
Projects in most sciences:
ESG
Exploiting Diverse Sources of Scientific Data
SEEK (Scientific Environment for
Ecological Knowledge): Vision
• Research, develop, and capitalize upon
advances in information technology to
radically improve the type and scale of
ecological science that can be
addressed
– Scalable synthesis
Michener
Data Dispersion Challenges
• Data are massively dispersed
–
–
–
–
Ecological field stations and research centers (100’s)
Natural history museums and biocollection facilities (100’s)
Agency data collections (10’s to 100’s)
Individual scientists (1000’s)
– Maintenance must be local
Michener
Data Integration Challenges
• Data are heterogeneous
– Syntax
• (format)
– Schema
• (model)
– Semantics
• (meaning)
Jones
Ecological Modeling Challenges
• Analysis and modeling tools are:
– Specialized
– Disconnected
– Proprietary
• It is:
–
–
–
–
–
–
Difficult to revise analyses
Hard to document analyses
Impossible to reliably publish models to share with colleagues
Hard to re-use models and analyses from colleagues
Difficult to use grid-computing for demanding computations
Labor-intensive to manage data in popular analysis software
Michener
Exploiting Diverse Sources of
Scientific Data: the approaches
Data Discovery/Access
Metadata
To describe the data sets
Ontologies
To define the terminology used
Standardisation of formats
For the exchange of data
Life Science Identifiers (LSIDs)
To uniquely identify and resolve data objects
Provenance of data
To record where the data has come from
And what has happened to it en route.
GRID/Web technology
Distributed data management
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of
Scientific Data: the approaches
Data Integration/Linking
Metadata
To know how to interpret the data sets
Ontologies
To know how data in the data sets might be related
To aid automatic transformation of the data
Standardisation of formats
To ease integration
Life Science Identifiers (LSIDs)
To know when 2 things are the same
Workflows
To enable refinement and repetition of integration
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of
Scientific Data: the approaches
Data Analysis
Metadata
To know how to interpret the data sets
Ontologies
To know analytical/transformation processes appropriate
Workflow Tools
To ease analytical processes
Recording/reuse of analytical processes
Provenance
Recording life history of data
To enable validation
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of
Scientific Data: the technologies
Standardisation of formats
Metadata
Ontologies
Life Science Identifiers (LSIDs)
Provenance
Workflow Tools
GRID/Web technology
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of
Scientific Data: the technologies
Standardisation of formats
Metadata
Ontologies
Life Science Identifiers (LSIDs)
Provenance
Workflow Tools
GRID/Web technology
Exploiting Diverse Sources of Scientific Data
Meta Data: the vision
Meta data - "data about data"
keywords, title, creator ….
If scientists marked up their data with the
agreed meta data it would be trivial to find
highly relevant data (sub-)sets for analysis…
Meta-utopia….
Exploiting Diverse Sources of Scientific Data
Meta-utopia
A world of complete, reliable metadata.
In meta-utopia,
Everyone uses the same language
and means the same thing…
The guardians of epistemology have rationally
mapped out a schema or hierarchy of ideas.
that everyone adheres to…
Scientists accurately describe their methods,
processes and results.
so anyone can do anything with it in the future…
Cory Doctorow
Exploiting Diverse Sources of Scientific Data
Meta Data: the approach
Common language
XML Schemas to describe data/meta data
Domain specific exchange schemas
Explosion of these in every domain
Exchanging data
Archiving data
Exploiting Diverse Sources of Scientific Data
Ecological Metadata Language
A look inside the meta-utopia of
ecology
knb.ecoinformatics.org
Identification: dataset elements
knb.ecoinformatics.org
Identification: resource elements
knb.ecoinformatics.org
Identification: party elements
knb.ecoinformatics.org
Discovery: coverage elements
Geographic
Temporal
Taxonomic
knb.ecoinformatics.org
Evaluation Level Information
knb.ecoinformatics.org
Evaluation: Method Information
knb.ecoinformatics.org
Evaluation: Project Information
L3
knb.ecoinformatics.org
Access: Permissions Information
L4
knb.ecoinformatics.org
Access: Physical Information
knb.ecoinformatics.org
Access: Physical formatting details
knb.ecoinformatics.org
Access: Distribution Information
L4
knb.ecoinformatics.org
Integration Level Information
knb.ecoinformatics.org
Integration Level: Attribute structure
knb.ecoinformatics.org
Integration Level: attribute domains
knb.ecoinformatics.org
Integration Level: attribute domains
knb.ecoinformatics.org
Integration Level: measurementScale
knb.ecoinformatics.org
Meta Data: the approach
Common language
XML Schemas to describe data/meta data
Domain specific exchange schemas
Explosion of these in every domain
Exchanging data
Archiving data
Turned into extensive specifications
Difficult to know where to stop…
Exploiting Diverse Sources of Scientific Data
but even this wasn’t enough…..
It’s not good enough to have meta-data, we
need to know what the terms in the meta-data
(schema or data values) mean.
Exploiting Diverse Sources of Scientific Data
Ontologies – the vision
If we understood the meaning of the schema
and the terms used in the meta-data or
databases we would be able to:
find things more reliably,
integrate things more easily,
reason about what things are comparable….
because we have support for automatic inference
Exploiting Diverse Sources of Scientific Data
Ontologies – the approach
Common Language…
OWL?
RDF, OWL lite, OWL DL, OWL full…..
Domain specific ontologies
or project specific?
Map different ontologies
Modularise the ontologies
Reuse..
Build upper ontologies to which domain
ontologies extend/link
Exploiting Diverse Sources of Scientific Data
Biodiversity Base Ontology
Core Layer
BDI Core Taxon Name
BDI Core Taxon Concept
BDI Core BioSpecimen
BDI Core BioObservation
Similar to…
SEEK Observation ontology
Josh Madin
entity
An extension point for domain-specific terms
Josh Madin
Characteristic
Josh Madin
Measurement standard
Similar to…
All the units, scales, indices, classifications, and lists used
for ‘measuring’ a characteristic
Josh Madin
Semantic Web for Earth and Environmental
Terminology (SWEET)
Ontologies revised and validated Jan 26, 2006
Earth Realm
Physical Phenomena
Physical Process
Physical Property
Physical Substance
Sun Realm
Biosphere
Data
Data Center
Human Activity
Material Thing
Numerics
Sensor
Space
Time
Units
Exploiting Diverse Sources of Scientific Data
Takes us back
to…
BDI Taxon Concept Ontology
…is really just
a schema for
representing
…
Biological Taxonomy
Classify and name all organisms in the world
So we can talk about them, experiment with them
Do life science…
The longest running attempt at building an ontology?
Linnaeus binomial system of nomenclature started in 1758
An attempt to resolve a long standing problem in biology
Many ways to classify things
Understanding continually changes with new discoveries &
technologies
Classifications continually being redone
New things defined, New definitions given for things in existence
Lots of classifications over time
Many compete at any one point in time
Exploiting Diverse Sources of Scientific Data
Taxonomic history of imaginary
genus Aus L. 1758
Linneaus 1758
Aus L.1758
Archer 1965
Aus L.1758
Aus aus L.1758
Aus aus L.1758
Aus bea
Archer 1965
Fry 1989
Tucker 1991
Pargiter 2003
Aus L.1758
Aus L.1758
Aus L.1758
Aus aus L.1758
Aus aus L. 1758
Aus aus
L.1758
Aus ceus
BFry 1989
Aus bea
Archer 1965
Aus cea
BFry 1989
Aus cea
BFry 1989
(vi) Xus Pargiter 2003
Xus beus (Archer)
Pargiter 2003.
Pyle 1990
5 Revisions of Aus
1 name spelling change
Aus bea and Aus
cea noted as invalid
names and replaced
with Aus beus and
Aus ceus.
Exploiting Diverse Sources of Scientific Data
Taxonomic history of imaginary
genus Aus L. 1758
Linneaus 1758
Aus L.1758
Archer 1965
Aus L.1758
Aus aus L.1758
Aus aus L.1758
Aus bea
Archer 1965
Fry 1989
Tucker 1991
Pargiter 2003
Aus L.1758
Aus L.1758
Aus L.1758
Aus aus L.1758
Aus aus L. 1758
Aus aus
L.1758
Aus ceus
BFry 1989
Aus bea
Archer 1965
Aus cea
BFry 1989
Aus cea
BFry 1989
(vi) Xus Pargiter 2003
Xus beus (Archer)
Pargiter 2003.
• 8 Names
• 2 genus
• 6 species
Pyle 1990
Aus bea and Aus
cea noted as invalid
names and replaced
with Aus beus and
Aus ceus.
Exploiting Diverse Sources of Scientific Data
C0.1
C0.2
N0
Results in
many
concepts for
each name
N0 - Aus L.1758
C0.3
C0.4
C0.5
C1.1
N1
N1 - Aus aus L.1758
C0.1 - Aus L.1758 sec. Linneaeus 1758
C0.2 - Aus L.1758 sec. Archer 1965
C0.3 - Aus L.1758 sec. Fry 1989
C0.4 - Aus L.1758 sec. Tucker 1991
C0.5 - Aus L.1758 sec. Pargiter 2003
C1.1 - Aus aus L.1758 sec. Linneaeus 1758
C1.2
C1.2 - Aus aus L.1758 sec. Archer 1965
C1.3
C1.3 - Aus aus L.1758 sec. Fry 1989
C1.4
C1.5
C2.2
N2
C1.4 - Aus aus L.1758 sec. Tucker 1991
C1.5 - Aus aus L.1758 sec. Pargiter 2003
C2.2 - Aus bea Archer 1965 sec. Archer 1965
C2.3
C2.3 - Aus bea Archer 1965 sec. Fry 1989
C3.3
C3.3 - Aus cea Fry 1989 sec. Fry 1989
C3.4
C3.4 - Aus cea Fry 1989 sec. Tucker 1991
N5
N5 - Aus ceus Fry 1989
C5.5
C5.5 - Aus ceus Fry 1989 sec. Fry 1989
N6
N6 - Xus beus Pargiter 2003
C6.5
C6.6 - Xus beus Pargiter 2003 sec. Pargiter 2003
C7.5
C7.6 - Xus Pargiter 2003 sec. Pargiter 2003
N2 - Aus bea Archer 1965
N3
N3 - Aus cea Fry 1989
N4
N4 - Aus beus Archer 1965
N7
N7 - Xus Pargiter 2003
8 Names
17 Concepts
Possible interpretations of
Aus aus L. 1758
 Request data sets about Aus aus (N1)
 what’s returned?
C1.1




Original concept: C1.1
N1 - Aus aus L.1758
Most recent concept: C1.5
Preferred Authority (e.g. Fry 1989): C1.3
Everything ever named N1:
N1
Union(C1.1,C1.2,C1.3,C1.4,C1.5)
 Best fit according to some matching algorithm
Best(C1.1,C1.2,C1.3,C1.4,C1.5)
 New concept containing only those features
common to all concepts with the name N1:
Intersection(C1.1,C1.2,C1.3,C1.4,C1.5)
C1.2
C1.3
C1.4
 Is it appropriate to link or merge data on this?
 Depends on the user’s purpose
 Level of precision required
Exploiting Diverse Sources of Scientific Data
C1.5
Classifications synonymy relationships
between concepts and names.
N7
N0
Parent child relationships in 5 revisions
C0.1
C1.1
C0.2
C1.2
C0.3
C2.2
C1.3
C2.3
C0.5
C0.4
C3.3
C1.4
C3.4
C1.5
C7.5
C5.5
C6.5
N5
N6
Names for each of the concepts
N1
N2
N3
N4
In the literature taxonomists tell us names that are synonymous with their concepts
Exploiting Diverse Sources of Scientific Data
Classifications synonymy relationships
between concepts and names.
N7
N0
C0.1
C1.1
C0.2
C1.2
C0.3
C2.2
C1.3
C2.3
C0.5
C0.4
C3.3
C1.4
C3.4
C1.5
C7.5
C5.5
C6.5
Which can result in anything being returned for Aus aus by traversing the synonymy links
N1
N2
N3
N4
Exploiting Diverse Sources of Scientific Data
N5
N6
Classifications with set relationships between
concepts.
We can build systems to return data suit for purpose
N7
N0
What we need are the set relationships from concepts in a revision to earlier concepts
C0.2
C0.1
C1.1
C1.2
C0.3
C2.2



C1.3

C2.3
C3.3
C1.4


C0.5
C0.4


C3.4
C1.5




C7.5
C5.5
C6.5
N5
N6
and name changes related to earlier names
N1
N3
N2

=
N4
=
Exploiting Diverse Sources of Scientific Data
Real Taxonomic Revisions
German mosses
14 classifications in 73 years
covering 1548 taxa
only 35% thought to be stable concepts
65% of names used in legacy data sets are ambiguous
and we don’t know which ones??
we need computers to help understand this…
Smaller classifications are combined into large
classifications
ITIS – integrated taxonomy (also changing) approx. 250,000
taxa
Taxonomic Revision of genus Alteromonas
34 years: from 1972 to 2006
Thanks to George Garrity, Michigan State Univ.
Exploiting Diverse Sources of Scientific Data
1972
Alteromonas
macleodii(T)
communis
vaga
1972
1973
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
1972 1973
1976
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
1972 1973 1976
1977
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
1972 1973 1976 1977
1978
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
1972 1973 1976 1977 1978
1979
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
1972 1973 1976 1977 1978 1979
1981
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
1972 1973 1976 1977 1978 1979 1981
1982
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
1972 1973 1976 1977 1978 1979 1981 1982
1984
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
1972 1973 1976 1977 1978 1979 1981 1982 1984
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
1986
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
Shewanella
putrifaciens(T)
benthica
hanedai
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
1987
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
Shewanella
putrifaciens(T)
benthica
hanedai
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
1988
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
Shewanella
putrifaciens(T)
benthica
hanedai
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
1990
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
1992
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
1995
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
1995
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
1997
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
2000
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
2001
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
japonica
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
tetrodonis
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
2002
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
japonica
denitrificans
livingstonensis
alleyanna
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
tetrodonis
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
primoryensis
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
stellipolaris
litorea
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
japonica
denitrificans
livingstonensis
alleyanna
mariniintestina
saire
schlegeliana
gaetbuli
5 others
2004
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
tetrodonis
12 others
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
primoryensis
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
stellipolaris
litorea
2 others
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
japonica
denitrificans
livingstonensis
alleyanna
mariniintestina
saire
schlegeliana
gaetbuli
8 others
2005
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
tetrodonis
14 others
1972 1973 1976 1977 1978 1979 1981 1982 1984 1986 1987 1988 1990 1992 1995 1997 2000 2001 2002 2004 2005
Oceanosprillum
Marinomonas
linum(T)
communis(T)
japonicum
vaga
minutium
mediterannea
biejerinckii
primoryensis
maris
maris
maris
williamsae
hiroshimense
multiglobiferum
pelagicum
pusillum
commune
jannaschii
kreigii
vagum
biejerinckii
pelagicum
maris
hiroshimense
Alteromonas
macleodii(T)
communis
vaga
haloplanktis
rubra
citrea
esperjiana
undina
aurantia
putrifaciens
hanedai
luteoviolaceae
denitrificans
colwelliana
tetradonis
atlantica
carageenovora
distincta
fuliginea
elyakoviii
stellipolaris
litorea
2 others
Shewanella
putrifaciens(T)
benthica
hanedai
colwelliana
algae
fridgidimarina
geldimarina
woodyii
amazonensis
baltica
oneidensis
pealeana
violacea
japonica
denitrificans
livingstonensis
alleyanna
mariniintestina
saire
schlegeliana
gaetbuli
13 others
2006
Pseudoalteromonas
haloplanktis
haloplanktis(T)
haloplanktis
tetradonis
atlantica
aurantia
carrageenovora
citrea
esperjiana
luteoviolacea
nigrifaciens
pisicida
rubra
undina
antartica
bacteriolytica
prydzensis
tunicata
distincta
elyakovii
peptidolytica
tetrodonis
14 others
May 2004
November 2004
Gammaproteobacteria
Alteromonadales
Alteromonadacea
Colwelliaceae
Alteromonas
Colwelliaceae
Aestuariibacter
Thalassomonas
Alishewanella
Colwellia
At the species Ferrimonas
level
18 “emendations”
Glaciecola
21 new species
Idiomarina
19 species reassigned to 4 genera
Marinobacter
3 new combinations
Marinobacterium
6 synonyms
2 speciesMicrobulbifer
to subspecies
2 subspecies
to species
Moritella
50 names, five genera, five families, and two
classes but….Pseudoalteromonas
only 5 validlyPsychromonas
published species.
At the higher level
Shewanella
1 Family 16 genera
-> 8 families 12 genera
Thalassomonas
1 unclassified genus -> 7 unclassified genera
Incertae sedis
Which is correct?
Teredinibacter
Which is supported/recorded
in the data?
What is the impact on Analysis?
Ferrimonadacea
Ferrimonas
Pseudoalteromonadaceae
Pseudoalteromonas
Algicola
Idiomarinaceae
Idiomarina
Psychromonadaceae
Psychromonas
Incertae sedis
Agarvorans
Alishewanella
Shewanellaceae
Shewanella
Moritellaceae
Moritella
Marinobacter
Marinobacterium
Microbulbifer
Salinomonas
Teredinibacter
Meta-utopia - a pipe dream?
What is meta-data?
Ecological Data set
Your meta data is my
data…
Depends on your
perspective
But it’s useful to
differentiate for certain
purposes
META DATA
It’s all data anyway…..
Taxonomic Data
DATA
How you see the world
What’s important to you
What you want to do with
the “data”
Meta data
Name:
Year:
Linnaeus
1758
Taxon
Higher Taxon
Picea
Pinaceae
Picea abies
Picea
Picea rubens
Picea
Exploiting Diverse Sources of Scientific Data
Data
Meta-utopia - a pipe dream?
Schemas aren't neutral
Presumes there is a "correct" way of modelling or
categorising ideas
that, given enough time and incentive, people can agree
on the correct way…
Any hierarchy of concepts necessarily implies the
importance of some axes over others.
Exploiting Diverse Sources of Scientific Data
Geographic/cartographic
perspective
Instance of Picea rubens
is-a feature that can be
mapped
Features inherently have
geospatial coordinates.
Taxonomic perspective
Instance of Picea rubens is a
specimen of some biological
taxon
Taxa inherently have
characteristics used in
classification
Feature
Building
Observation
Organism
occurrence
Pinaceae
Picea
Picea abies
Exploiting Diverse Sources of Scientific Data
Picea rubens
Picea rubens
Meta-utopia - a pipe dream?
There's more than one way to describe
something
Exploiting Diverse Sources of Scientific Data
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
There's more than one way to describe
something
Reasonable people can disagree forever on how to
describe something.
Requiring scientists to use the same vocabulary to
describe their data enforces homogeneity in ideas.
Which could limit science…
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
Metrics influence results
Agreeing to a common metric for measuring
important things in a domain necessarily privileges
the items that score high on that metric, regardless
of those items' overall suitability.
Ranking axes are mutually exclusive
software that scores high for security scores low for
convenience,
Everyone wants to emphasize their high-scoring
axes
and de-emphasize (or, if possible, ignore altogether) their
low-scoring axes.
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
People are not altruistic
Scientists have their own immediate deliverables
Doesn’t leave time for thinking about who else might do what with
their data
Metadata exists in a competitive world.
People want their work cited and will (ab)use meta-data to do so.
People are busy
e-Scientists understand the importance of excellent
metadata
Jo-scientist is mainly concerned about publishing the results.
No time for added extras
Exploiting Diverse Sources of Scientific Data
Meta-utopia - a pipe dream?
People make mistakes
Even when there's a positive benefit to creating
good metadata, people don’t exercise enough care
and diligence in their metadata creation.
Mission Impossible?
Simple observation demonstrates people are poor
observers of their own behaviours.
Therefore any meta data will be a poor representation
Exploiting Diverse Sources of Scientific Data
Life Science Identifiers (LSIDs):
the vision
 WWW provides a globally distributed communication framework
 LSID and the LSID Resolution System
 will provide a simple mechanism to globally resolve locally named
objects distributed over the WWW.
 LSIDs will allow us to know
 what kind of object it is,
 who originated it,
 who is responsible for it,
 how to interface to it and
 what computations might be carried out on it.
 Adoption of LSIDs
 will facilitate more reliable integration of multiple knowledge bases,
each of which has partial information of a shared domain
 will encourage stronger global collaboration in life sciences.
Clark T., Martin S., Liefeld T. Globally Distributed Object
Identification for Biological Knowledgebases Briefings in
Exploiting Diverse Sources of Scientific Data
Bioinformatics 5.1:59-70, March 1, 2004.
Life Science Identifiers
URI based naming scheme
An LSID has data
- gene sequence in
GenBank
urn:lsid:ipni.org:names:1234-1An LSID
has metadata
- ecological
data set (in
- format
oftext
the file)
data
excel,
or in a
- display title for clients
- image
- Dublin
metadata
The data
shouldcore
never
change
-anything
- can
version you want
The metadata can change
retrieval framework
Get data
Data record
Get metadata
RDF
LSID
resolver
http://lsid.sourceforge.net/
Exploiting Diverse Sources of Scientific Data
Issues For Each Community
What gets an LSID?
Real life objects
Biological specimen
Abstract concepts
Taxon concept or name – Bellis perennis
Electronic representations of things
Image of specimen, description of specimen or concept
For each thing, what’s the data and metadata?
LSIDs
Data doesn’t change but Meta data can
Should all data become meta data?
Maybe it implies a temporal database approach
Exploiting Diverse Sources of Scientific Data
Issues For Each Community
 Who issues LSIDs?
 Owner of data
Not always clear who owns data especially legacy data
 A central authority
 One authority responsible for issuing LSID for specific types of
information
 This would help enforce a 1:1 mapping of LSIDs and data items
 It MAY also reduce the likelihood of LSIDs becoming unresolvable
 A respected authority
 This would help enforce a 1:1 mapping for those who use the authority
 It may also be more feasible
 Free for all (possibly with an index)
 List your LSID authority in an index so your LSIDs are easy to find
 Perhaps structured delegation has best potential to globally
unite science
Exploiting Diverse Sources of Scientific Data
Organizations Using LSIDs
 Biopathways consortium
 National Center for Biotech Information (NCBI)
Pubmed, Genbank
 European Bioinformatics Institute (EBI)
 BioMOBY – an biological database interoperability program
(biomoby.org)
 represent all entities in MOBY Ontologies (Object, Service, and
Namespace), as well as all instances of BioMOBY services.
 myGrid (mygrid.org.uk)
 used throughout as object naming device
 TDWG (tdwg.org)
 IPNI – plant names
 Index Fungorum – fungi names
 US Long Term Ecological Research Network (LTER)
 SEEK (seek.ecoingformatics.org)
 Used in Kepler – actors, components, TOS – taxon concepts…
Exploiting Diverse Sources of Scientific Data
Use of LSIDs
Ecological Data Sets
Hippocampus
tetragonous
Mitchill, 1814
Lined
seahorse
347
347
Hippocampus
erectus
Hippocampus
marginalis
Kaup, 1856
347
347
Hippocampus erectus Perry 1810
urn:lsid:biocast.org:concept:347
TAX
347
347
347
Moving to a world of LSIDs
 Using LSIDs alone will not address all issues of data sharing
 Data repositories must (re)use LSIDs to cross reference data
 within and outwith their own repository.
 it is important that we use the same LSID to refer to the same entity
 If multiple LSIDs exist for the same entity we would be required
to decide whether or not two LSIDs were really the same thing.
 We would be in a worse situation than we are today,
for example when trying to decide if two taxonomic names mean the same.
 Generating LSIDs for any self contained data set is a fairly
trivial task
 Appointing LSIDs to existing data from an authoritative
repository to re-use them is more challenging
 Investigate what’s involved…
Exploiting Diverse Sources of Scientific Data
Convert Data Provider to use
LSIDs
Hexacorallia
Data
Provider
Map to ontology
Linker
Tool
Original data
RDF Data
to be
repository
(target)
updated with LSIDs
Authority
LSID
from
authority
resolution
providersservices
(source)
Hexacorallia
Data
Triple Store
Match data
data from
from repository
repository with
with data
data in
in LSID
LSID resolvers
resolvers
Match
and return
return LSID
LSID to
to repository
repository
and
Specimen
LSID
+ RDF
Name
LSID
+ RDF
Concept
Publication
LSID
+ RDF
LSID
+ RDF
Exploiting Diverse Sources of Scientific Data
Person
LSID
+ RDF
Linking….
WASABI Service Request Dispatcher
SPARQL
LSID
OAI
local (“target”) provider
Hexacorallia
Thematic
Triple Store
Linker Client
Request
linkable
classes and
select one to
be linked
Linker
SPARQL
LSID
OAI
WASABI Service Request Dispatcher
Exploiting
Diverse Sources
of &
Scientific
authoritative
(“source”)
provider
linker Data
Person
Triple
Store
Linking….
WASABI Service Request Dispatcher
SPARQL
LSID
OAI
local (“target”) provider
Hexacorallia
Thematic
Triple Store
Select class
to be linked
Linker Client
Linker
SPARQL
LSID
OAI
WASABI Service Request Dispatcher
Exploiting
Diverse Sources
of &
Scientific
authoritative
(“source”)
provider
linker Data
Person
Triple
Store
Linking….
WASABI Service Request Dispatcher
SPARQL
LSID
OAI
local (“target”) provider
Request
possible
LSIDs
Linker
Hexacorallia
Thematic
Triple Store
Linker Client
SPARQL
LSID
OAI
WASABI Service Request Dispatcher
Exploiting
Diverse Sources
of &
Scientific
authoritative
(“source”)
provider
linker Data
Person
Triple
Store
Confirm/Skip Annotations
Person to
find LSID
for
Choice of
possible persons
with LSIDs
Exploiting Diverse Sources of Scientific Data
Issues in converting to LSIDs
 Mapping to ontology
 LSIDs  RDF  schema?  ontology?
agreement on ontology - problem?
 Replace or annotate existing data?
 If we replace an author with a person LSID
 what is returned when resolving that LSID won’t likely be what data was
stored in DB for an author.
 Dependencies between objects with LSIDs
 If you link via a taxon name LSID – the resolved name should have
embedded an LSID for a publication – so there shouldn’t be any need (in
principal) to match publications for names
 What about authorities that issues LSIDs but don’t map to other
authorities
 e.g. name providers not mapping to either publication or specimen
providers
Exploiting Diverse Sources of Scientific Data
Issues in converting to LSIDs
 What support would a linking tool need to provide end users?
 How would users want to process this data
 How much automation?
E.g. above a certain confidence level
 Would this be trusted?
 Order of matching
E.g. match all instances of persons at once
Match of persons by publication?
 Other Issues…
 Performance of existing linking tool approach
Lots of data passing going on
Need more efficient approach which matches user needs
 Finding authorities that provide linking services
How do scientists find out about authorities with linking services?
How do you they which ones to use?
Exploiting Diverse Sources of Scientific Data
To Summarise….
 We have seen that (Life) Science is
 Complex & Changing
 The fundamental challenges of science that have always been
there are still here
 Now we have additional opportunities associated with the explosion of
scientific information and the move to a virtual world
 And now the challenge is how best to exploit these….
 e-Science uses computation to aid scientists
 By providing appropriate infrastructure and tool support
Speed up scientific processes
Do them repeatedly
Re-evaluation
 Can give scientists time for more thoughtful science…
May require a change of emphasis in how scientists work
 Must support the inherent features of science, scientists and scientific
data
Exploiting Diverse Sources of Scientific Data
e-Science: Complex Science
Support decomposition of scientific domains, problems
and associated data
Fundamental to data & software analysis and design
Support re-composition, linking or building on the
components
Need to know when components or links have changed
Identify the overlaps/linkages in the different domains
Need useful approximations of things to simplify linked
domain
Need to understand the approximations or linking points well
Raise level of abstraction
Artefact of storage mechanisms
Implies lingua franca
Need more evaluation of the different approaches
Exploiting Diverse Sources of Scientific Data
e-Science: Changing Science
Science is full of legacy data
Today’s scientific research is tomorrow’s legacy data
Provide long-term persistent storage
Any published scientific discovery should store the data as
evidence
Data needs to be accurately annotated
Sufficient to repeat analyses to test hypotheses
e-Science already changing the way scientists do
science
But to be effective it needs to change even more…
More emphasis on well curated, accessible, persistent data
Evidence for results
Exploiting Diverse Sources of Scientific Data
Meta Data & Ontologies?
Do we throw out meta data/ontologies, then?
No…
To benefit from stored data we need to know what it means!
However, there are no large-scale benefits while there
is insufficient coverage of meta data
if only 10% data has meta data people won’t use meta
data…
Need to reach the tipping point…
Controlled vocabulary and schemas shown useful for
large projects or small communities with common goal
Need long-term projects to see if they sustain their value as
the community and the science evolves.
Exploiting Diverse Sources of Scientific Data
Describe or Prescribe?
Descriptions become a vocabularies used by
others
Folksonomy or ontologies?
Informal versus formal or free versus constrained
Informal can be basis for something formal
Move towards common vocabularies
with built in flexibility and extensibility
Issue of what language(s)…
Need more research evaluating these issues…
Exploiting Diverse Sources of Scientific Data
Reliability of Meta Data
Automatic recording of meta data
From machines, software, workflows…
Avoids labour
Starting to happen
Helps reach critical mass of available meta data
Still need to decide what it is that the
machines/software are collecting…
Human input still needed
Purpose of experiment, deviations from planned protocol
etc.
Exploiting Diverse Sources of Scientific Data
Support
Community ontologies need to be easily available to
all scientists
Listing the known ontologies on a web site is not enough
Need to understand when (meta) data is fit for
purpose
Accurate enough, not overly precise
Need collaborative approaches to extending
ontologies
Allow users to be involved to achieve community buy-in
Ontologies are difficult for people to comprehend
Need good visualisation
Need to trust system
Exploiting Diverse Sources of Scientific Data
Tools
Simple tools would go a long way to help
Contextual data is consistent for many data sets
e.g. observer/location
Tools should support collection and re-use of this data
Make use of (incorporate) existing ontologies into tools
Get the software to do as much work as possible
Good at repetitive tasks, faster than humans
Personalisation
How application specific do tools have to be to be useful
Generic/ Domain specific/ Individual?
The more generic the more widely applicable
Pluggable components for personalisation?
Exploiting Diverse Sources of Scientific Data
Finally…
 It will take time and commitment for any of these approaches to
work.
 Focus on central important resources that are reused in many
(sub-)domains
 Ensure the data are well managed and curated, identified, described,
easily available, lasting and evolving
 Observe whether they benefit the community or act as a
straight jacket
 A good test case for this approach is the development of a
taxon concept name resolution service
 To allow scientists to find correct names for the concepts they are
working with,
 Mark up their data,
 Resolve their concepts against other scientists’ data so they know they
are talking about the same thing.
 Is central to communication in all life sciences
 Poses many computational, social and data research issues
Exploiting Diverse Sources of Scientific Data
Acknowledgements
E-Science Institute for sponsoring theme leadership
Malcolm Atkinson
For support and many interesting discussions on exploiting
scientific data.
Collaborators
on SEEK project,
Matt Jones, Bill Michener, Aimee Stewart, Robert Gales, Josh Madin,
Shaun Bowers
Collaborators in TDWG/GBIF
Robert Kukla, Roger Hyam,
funding, slides, interesting problems
Exploiting Diverse Sources of Scientific Data
Download