Choices and challenges in biological information management Heidelberg, Germany

advertisement
Choices and challenges in biological
information management
ORIEL
Les Grivell, European Molecular Biology Organisation,
Heidelberg, Germany
EMBO / EMBC activities
•
•
•
•
•
Fellowships + Fellows network
Courses & workshops
Young Investigator programme
Science & Society
Electronic information Programme
“Biological research has reached a point
where new generalizations and higher order
biological laws are being approached, but
may be obscured by the simple mass of data”
Harold Morowitz, 1985
Report to the U.S. National Academy of Sciences
One part of the information explosion ….
1.20E+10
Huma n c omple t e dra ft ( 3.1 G bp)
1.00E+10
Ara bidopsis (125.4 M bp)
Huma n c hr. 22 (34.5 M bp)
8.00E+09
Drosophila ge nome (137 M bp)
6.00E+09
C. e le ga ns ge nome (97 M bp)
4.00E+09
Morowitz
Ye a st ge nome (14 Mbp)
2.00E+09
Va rious mic robia l ge nome s
0.00E+00
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
Ye a r
1992
1993
1994
1995
1996
1997
1998
1999
2000
Raw sequences are not the only
form of digital information
Genomics-related data
• Vast amounts from high-throughput technology
(the Sanger centre alone now produces around 60
GB raw sequence data per day)
• Genomics-based information is heterogeneous
highly complex and evolves continuously as data
are updated and / or ideas develop
• There is a necessity to:
– Discern and understand the relationships between data
generated by different experimental approaches
– To manipulate, analyze and / or integrate this
information
The knowledge cycle (384-well format)
Idea!
Slow; ratelimiting
Databases
(e-) Literature
Hypothesis
Publication
Experiment
Data
Biological information: current reality …..
• Hundreds of different databases, many in flat-file
format
– Non-uniform or lack of external identifiers
– Lack of interoperability at the level of syntax and
semantics
• A vast amount of information accumulating in
images, video’s, molecular model
• And knowledge is scattered across the literature in
many thousands of non-computer readable journal
articles
Deciphering gene symbols
E-BioSci
A new information service for the life sciences
that will interlink factual and image data
repositories with the research literature
EU Quality of Life research infrastructure:
platform under construction
Closely linked to
, the research arm of EBiosci, funded within the European Commission’s
IST programme
The current E-BioSci partnership
• Distributed network of information resources
• Europe-based; world-wide role
The E-BioSci platform
• Set of distributed biological resources (literature,
sequence- and image- databases)
• Full-text search
– across document repositories
– using cross-language queries (e.g. English – French, German, Spanish etc)
– 2-way navigation links between literature and
molecular datasets via gene symbol recognition
Main features implemented via conceptual fingerprinting
Conceptual fingerprints
Index and link index terms
to (multi-lingual) thesauri
Full text document
C19881 0.99
C92992 0.67
C02002 0.66
C99229 0.44
C00392 0.33
C93939 0.21
•1 CFP = 400 bytes
•Abstraction: 250.000 pages/PC/day
•Matching: 500.000 CFP’s: 40 millisec.
Fingerprint
database
The prototype search page
First search results …
Refining the search …..
Gene symbol recognition: synonyms
Gene –literature link
Gene symbol recognition: the homonym problem
Resolution of gene ambiguities
Gene symbol recognition
•
•
•
•
Prototype currently limited to human genes
Synonyms recognised well
Homonyms still a problem
Extension to other (model) organisms
ongoing
Interlinking images with other resources
E-BioSci and semantic interconnection
of searchable resources
Literature,
Patents etc
Database
annotations
(sequences,
images etc)
Open
archive
repositories
Fingerprint
Many of these
aims will require
significant
research effort
Fingerprint
Database(s)
Community
resources
collections
Scientist
profiles
From
to
and back
Iterative prototyping and evaluation
Iterative feedback,improvement
BioImage
database
IMGT
database
CNR, ICGEB gene
analysis servers
CNR-EMMA
mutant mouse
database
Knowledge
representation
Navigation
tools
Gene mining
Adaptive
interfaces
Database linkage;
full text searches
ORIEL prototype
staging server
E-BioSci
servers and data
network
Test user
group
Main user
group
Acknowledgements
• Frank Gannon, Executive Director EMBO
• … and many others who contributed ideas to the
concepts of E-BioSci and ORIEL
• The E-BioSci and ORIEL partners
• European Commission
(contracts no QLRI-2001-30266 and IST-2001-32688)
Download