Imagine what would happen if researchers and clinicians used different, non-standard definitions and symbols for units of measure and different, non-standard definitions and symbols for the chemical elements. And yet: every day, translational science is inhibited by the use of different, non-standard definitions and symbols for diseases,
treatments, tissues, specimens, cells, proteins, DNA sequences, and so on in research data sets.
Two Barriers to Translation
1.
Lack of standard definitions and symbols impedes aggregation and cross-dataset analysis. Increasingly, aggregation of data across research groups, settings (the clinic, the hospital, the research lab, the classroom, the policymaker’s office, and so on), institutions, disciplines, and continents is necessary to the progress of translational science. The lack of standard definitions of and symbols for diseases, treatments, outcomes, risk factors, environments, body parts and regions, cells, proteins, genes, DNA sequences, drugs, and so on is significantly inhibiting this aggregation and thus impedes the progress of translational science.
2.
Lack of knowledge in computational tools impedes discovery from huge datasets. The size of the datasets, even before aggregation, is growing exponentially. Sifting through these data for the purposes of scientific discovery is now fully reliant on computational tools. However, greater progress would be possible if we could impart more knowledge to the computational tools in a way that leverages consistent aggregation.
These two problems are interrelated: algorithms that understand one definition of and symbol for each disease will not work on data collected with different definitions and symbols. Without standard definitions and symbols, we must either manually convert data or perform duplicative work on algorithms to enable the same kinds of discovery from two disparate data sets (let alone from very large data sets created by combining data from multiple origins).
The current situation is equivalent to one in which each researcher is printing her own currency: each set of definitions and symbols for disease, treatments, etc. is a different currency, and converting between two currencies is difficult and labor intensive, inhibiting “commerce” – and therefore innovation – in research.
Approach to Surmounting the Barriers: Using Standard Ontologies
One major and necessary component of the solution to these problems, even for purposes other than translational science, is the development and use of standard ontologies. An ontology is a standard set of definitions and symbols to refer to things in the world, to enable fast integration and aggregation of data created by different communities.
Ontologies also capture the classifications and definitions that scientists, educators, healthcare practitioners, policy
makers, and so on, use in their daily work. The goal is to promote the use by these individuals of standard definitions and symbols (or codes)—analogous to standard units of measure and standard symbols for the elements—when recording their data, so that fast data integration and aggregation becomes possible. As researchers make new discoveries and subsequently revise their classifications of the types of things in the world, the job of ontologists is to revise the ontologies to capture these changes, analogous to how additions to the Periodic Table are made.
Besides capturing the types and taxonomy of a particular field of study and/or practice, an ontology also captures nontaxonomic relations such as parthood, location, and so on. When these relations are structured as axioms in a formal logic, the ontology also supports computer algorithms in the advanced analysis of the data, allowing them to draw new conclusions that were not possible before the axioms existed.
Furthermore, adoption of standard ontologies allows the new algorithms to analyze any data set collected in accordance with the ontologies as well as any combination of such data sets, enabling analyses with a larger number of cases and correlations over a larger set of variables.
Benefits and Costs of the Approach
The net effect is that we vastly improve the efficiency with which researchers can make discoveries from data, and
maximize the potential of the data collected. Note that data are the ultimate outcome of research: investigators retire, reagents are used up in chemical reactions, laboratory equipment becomes obsolete, laboratories and buildings are destroyed to make way for new construction, and so on. Only the data remain. Therefore, given the great cost to society to create data as the ultimate outcome of research, we ought to maximize the value and reuse of the data.
Standardization of data through the development and use of standard definitions and symbols captured in ontologies holds great potential in this regard. Well-structured standard ontologies can serve as a catalyst and have a transformative effect on translational science.
The evidence that ontologies have this transformative effect exists: witness the advances made possible through the widespread adoption and use of the Gene Ontology (GO). In addition to enabling consistent aggregation of numerous, large datasets, it has led to the development of entirely new data analysis techniques, such as GO overrepresentation analysis.
The development and maintenance of well-structured standardized ontologies and promotion of their use requires sustained effort by core working groups that marshal the collaborative effort of the communities whose work would benefit from using these ontologies for data coding and analysis. Funding is needed for these core ontology groups.
The request