Enabling_European

advertisement
Enabling European-Wide Sharing of Data in the Life Sciences
Biological research is being transformed from a laborious and costly data-gathering discipline to a
highly collaborative science driven by systematic and (relatively) inexpensive data acquisition followed
by complex analysis. Similar to other data-intensive sciences such as high energy physics, astronomy and
oceanography, large datasets drive discoveries and form a bedrock of information on which lifescientists plan, execute and understand future investigations.
Much of biology relies on good, accurate, cataloguing of facts about biological systems – from the
DNA sequence of genomes through the three dimensional structures of proteins to arrangement of
molecules in molecular pathways. Since the mid-1970s systematic databases of known molecules have
been developed, first for protein structure (PDB) and then DNA sequence in the 1980s (EMBL, GenBank).
These databases, and others like them, have provided the bedrock for many discoveries, both planned
and serendipitous, over the decades.
The remarkable diversity addressed in life science encompasses 7 billion people worldwide, over 8
million eukaryotic species and at least 10 million bacterial species. Among eukaryotes, individuals
themselves can be a complex assemblage of cells, tissues and commensal organisms. This complexity
and diversity is altered continuously through the process of evolution, making data management a
daunting undertaking. Living organisms respond and interact with their environment, often through
mechanisms that are only partly understood. This observational and experimental complexity makes
metadata and provenance acquisition complex but critical. It also means that life science arguably
provides the most complex and heterogeneous datasets that science can currently imagine.
Life science needs a new approach.
The onset of high-throughput sequencing technologies has created a deluge of data. Most lifescience data archives double every 9-12 months with some disciplines growing even faster, for example
proteomics databases currently double in size every 4-5 months. With high-content biology and, in
particular, sequence-based biological assays becoming routine at every major bio-research centre, and
accessible by most of Europe’s life-science researchers, we need to connect data management,
standards, and services between all stakeholders - from local research institutes through to global core
reference data archives. Data-driven analysis and research relies on a large and growing number of
reference data resources and biological knowledge-bases that serve all life-science disciplines and
provide focused resources that are small but critically important for a single community. In Europe alone
there are over 1,800 bioinformatics resources (http://www.elixir-europe.org/documents/final-reportstrategy-data-resources).
Data needs to be Findable, Accessible, Interoperable and Re-usable (FAIR) to generate value for a
research community beyond the initial researcher’s laboratory. The importance of long-term
stewardship is highlighted by the observation that the odds of retrieving the data from a scientific
publication decline by 17% per year. Life science data infrastructure must be able to cope with the
aggregation, annotation and functional integration of data from thousands of laboratories across
Europe, as well as the access demands of users worldwide (e.g. the Human Protein Atlas received more
than 750,000 visits during 2013).
ELIXIR, established in 2014 as a legal entity, brings together Europe’s major life-science data
archives and, for the first time, connects these with national bioinformatics infrastructures. By
coordinating local, national and international resources the ELIXIR infrastructure will meet the datarelated needs of Europe’s 500,000 life-scientists. This scalable infrastructure connect and sustain lifescience’s core data archives and provides standards, tools and training for data stewardship.
ELIXIR is an Open Infrastructure: it does not “own” all data resources in Europe. ELIXIR provide a
coordinated ELIXIR Interoperability Backbone that allows partners (e.g. other Research Infrastructures,
national resources, institutional archives) to make use of existing resources and connect and
interoperate their own resources. Providing a sustainable infrastructure that manages data identifiers,
secures data archiving and access, and ensures mappings between resources will enable long-term, costeffective, data management and drive “standards as the default” across the life sciences.
This talk will make the case, through examples of value and reuse, that ‘Open data’ needs to go
beyond disclosure; to impact future research projects data needs to be managed - Findable, Accessible,
Interoperable and Re-usable research data requires infrastructure and well-trained experts that support
users.
Download