Choices and challenges in biological information management ORIEL Les Grivell, European Molecular Biology Organisation, Heidelberg, Germany EMBO / EMBC activities • • • • • Fellowships + Fellows network Courses & workshops Young Investigator programme Science & Society Electronic information Programme “Biological research has reached a point where new generalizations and higher order biological laws are being approached, but may be obscured by the simple mass of data” Harold Morowitz, 1985 Report to the U.S. National Academy of Sciences One part of the information explosion …. 1.20E+10 Huma n c omple t e dra ft ( 3.1 G bp) 1.00E+10 Ara bidopsis (125.4 M bp) Huma n c hr. 22 (34.5 M bp) 8.00E+09 Drosophila ge nome (137 M bp) 6.00E+09 C. e le ga ns ge nome (97 M bp) 4.00E+09 Morowitz Ye a st ge nome (14 Mbp) 2.00E+09 Va rious mic robia l ge nome s 0.00E+00 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 Ye a r 1992 1993 1994 1995 1996 1997 1998 1999 2000 Raw sequences are not the only form of digital information Genomics-related data • Vast amounts from high-throughput technology (the Sanger centre alone now produces around 60 GB raw sequence data per day) • Genomics-based information is heterogeneous highly complex and evolves continuously as data are updated and / or ideas develop • There is a necessity to: – Discern and understand the relationships between data generated by different experimental approaches – To manipulate, analyze and / or integrate this information The knowledge cycle (384-well format) Idea! Slow; ratelimiting Databases (e-) Literature Hypothesis Publication Experiment Data Biological information: current reality ….. • Hundreds of different databases, many in flat-file format – Non-uniform or lack of external identifiers – Lack of interoperability at the level of syntax and semantics • A vast amount of information accumulating in images, video’s, molecular model • And knowledge is scattered across the literature in many thousands of non-computer readable journal articles Deciphering gene symbols E-BioSci A new information service for the life sciences that will interlink factual and image data repositories with the research literature EU Quality of Life research infrastructure: platform under construction Closely linked to , the research arm of EBiosci, funded within the European Commission’s IST programme The current E-BioSci partnership • Distributed network of information resources • Europe-based; world-wide role The E-BioSci platform • Set of distributed biological resources (literature, sequence- and image- databases) • Full-text search – across document repositories – using cross-language queries (e.g. English – French, German, Spanish etc) – 2-way navigation links between literature and molecular datasets via gene symbol recognition Main features implemented via conceptual fingerprinting Conceptual fingerprints Index and link index terms to (multi-lingual) thesauri Full text document C19881 0.99 C92992 0.67 C02002 0.66 C99229 0.44 C00392 0.33 C93939 0.21 •1 CFP = 400 bytes •Abstraction: 250.000 pages/PC/day •Matching: 500.000 CFP’s: 40 millisec. Fingerprint database The prototype search page First search results … Refining the search ….. Gene symbol recognition: synonyms Gene –literature link Gene symbol recognition: the homonym problem Resolution of gene ambiguities Gene symbol recognition • • • • Prototype currently limited to human genes Synonyms recognised well Homonyms still a problem Extension to other (model) organisms ongoing Interlinking images with other resources E-BioSci and semantic interconnection of searchable resources Literature, Patents etc Database annotations (sequences, images etc) Open archive repositories Fingerprint Many of these aims will require significant research effort Fingerprint Database(s) Community resources collections Scientist profiles From to and back Iterative prototyping and evaluation Iterative feedback,improvement BioImage database IMGT database CNR, ICGEB gene analysis servers CNR-EMMA mutant mouse database Knowledge representation Navigation tools Gene mining Adaptive interfaces Database linkage; full text searches ORIEL prototype staging server E-BioSci servers and data network Test user group Main user group Acknowledgements • Frank Gannon, Executive Director EMBO • … and many others who contributed ideas to the concepts of E-BioSci and ORIEL • The E-BioSci and ORIEL partners • European Commission (contracts no QLRI-2001-30266 and IST-2001-32688)