Uploaded by mi.mo

Machine Learning for Precision Medicine Doctoral Thesis

advertisement
Machine learning for precision
medicine
Heli Julkunen
Aalto University publication series
Doctoral Theses 8/2025
Machine Learning for Precision
Medicine
Heli Julkunen
A doctoral thesis completed for the degree of Doctor of Science to
be defended, with the permission of the Aalto University School of
Science at a public examination held at the lecture hall T2 of the
school on 31 January 2025 at 12 noon.
Aalto University
School of Science
Department of Computer Science
Supervising professor
Prof. Juho Rousu, Aalto University, Finland
Preliminary examiners
Prof. Ron Do, Icahn School of Medicine at Mount Sinai, United States
Dr. Taru Tukiainen, Institute for Molecular Medicine Finland (FIMM), Finland
Opponent
Prof. Maik Pietzner, Precision Healthcare University Research Institute,
Queen Mary University of London, United Kingdom and Berlin Institute of
Health at Charité – Universitätsmedizin Berlin, Germany
Aalto University publication series
Doctoral Theses 8/2025
© Heli Julkunen
ISBN 978-952-64-2351-7 (paperback)
ISBN 978-952-64-2352-4 (pdf)
ISSN 1799-4934 (paperback)
ISSN 1799-4942 (pdf)
http://urn.fi/URN:ISBN:978-952-64-2352-4
Unigrafia Oy
Helsinki 2025
Heli Julkunen
Machine Learning for Precision Medicine
200
precision medicine, machine learning, predictive modelling, survival
analysis, risk prediction, metabolomics, drug combinations
!! "" """ ! $"
" ""!" "!'#" $#""$ "'"!
""! !#"$!"!"'
!""$!!$ #!# "!!#!!
" ! "! "!"! !$!$&
"'"#"# "!# $#!#"!"
$"' !"!"! ! " "#"!"!$ "$" ""!"'!! !$! "!! !%$ "! $#&"'"!"
!!""$#"""!"&" "#
"!"! !
!!! ""$!!#"" % !" !!
$ #!!"! !# """! #
"" ""!#"(" !!! !
!!!!" $""!"!!! !
" !"#" !"! % !
" """! #"! !!$ '!! $
$"$ &!""!' !!!
"! !"$' # " "!"
$ #"!' !%% !#!#"'& "'
$" ! % $!"" !'!"" ! #"! "# '"$ " ""!
!!"#"!&!"# "# !"
!!! ! "" #"'!!#"!
"" !!"#!"$" !
"" "" "" !!$ #!!!!#
!!!% "!" $#!'!"#"! #" !! "" $""
$"!! ! "' " !$" ""!
" $ ! !"!" " $# ' !
" "!" "! !!#"!!! "
"!# !" '!# ""$" # ""! !
!!!!"
"" "$"!!"! !""!
!! ""$"" !"# "" $"
" ""!" "! !
%
! Heli Julkunen
!" Koneoppimisratkaisuja täsmälääketieteeseen
!" !!! 200
täsmälääketiede, koneoppiminen, ennustava mallintaminen, elinaikaanalyysi, riskien ennustaminen, metabolomiikka, lääkeyhdistelmät
*!***"#!$" $'#*!"'!"
"$""##"""*!'""*'!+!"
"! #!!" ""#"""##"'*!!
!"'' ""#"" ! "
""!#"#$###!"#!"##
!"'!!!" **!"'!!"$""!""'!+!"*
"""$!!$''""** **'+!!*"!## "
""!"!"$##"""#"#!*'""++ **#!"##"
!##!+'"**"""#!"! #!
"""**! !"#! !#!"!! *
!## "#"!"""!"'+'"*$"#"
""'"*!!"*""!"*$"#"" "'!!*
*'"*+!* "!#"*!***""!
*!!*$*"+! !!""**!$"!!"*
"*!***"" !##"**'!""$#"#!"
#!"!!"!" '+'"*!!
! !"#! ! $!!!*##!"""*!!
! !"#! !" $"!*#!#!""
!"*!##""##!"**'!"
$#"#! !"!* "''!*"*" !"!"#!"!" **"*!"" ""* #!"#!" ##"""#!"##!**'!"!' ""
$$!"""**"!!""*" !"""$
**'!"* !"*!!!#""# "'!!"
'!"*""'!"*!'+$*!!
! #!#""! #"#$
*'""+*! !"#! !#!"!!!'+'"***$*!"+"!
"" **"#"#!""#!"$"##!"!
" !"$""" !"! #! !
#!"!!!##!!"! #!!!"
"#"""#!"$*!"+"! !!!*#!#!!!"**
!"*"$"" """#
#!"#$! !"#! !#!"!"""""$!"#
#!"##""#!*!"$# $#"#!" **"*!""
"##""" ##"" !#!"!!!$ ""#"$!
"#!! #! ""*"!""#!"
" "'+#""*!"*! !"#! ! $!!
"$""*!!*$*"+! !!!""'"##""*"!"
+'+!"!"*$*"''"!""!"!""*!'
"!" "!"*!***""!!*
Acknowledgements
This doctoral thesis marks the culmination of an incredible journey of
research, learning, and growth. It has been shaped by the support and
expertise of many remarkable individuals and institutions, to whom I am
deeply grateful.
First and foremost, I would like to express my heartfelt thanks to my
supervising professor, Juho Rousu, for believing in me and my abilities
from the very beginning. Your guidance, encouragement, and leadership
have been invaluable during this academic journey. From my first steps as
a research intern in your group to this point, your mentorship has played
a pivotal role in my development as a scientist, and your expertise and
leadership have left an undeniable mark on this work.
I extend my deepest gratitude to all my co-authors. My special thanks
goes to Anna Cichońska, the second author on most of the publications in
this work. You introduced me to the world of scientific research during my
early days as an undergraduate student at Aalto University and continued
to guide me through my growth as a researcher and later as a colleague
at Nightingale Health. Beyond your exceptional scientific expertise and
guidance, you have become one of my dearest friends. Your brilliance,
kindness, and dedication are truly inspiring, and I deeply cherish the
many experiences we have shared, both professionally and personally.
I also wish to thank all co-authors involved in the drug combination
prediction work, including Sandor Szedmak, Prson Gautam, Jane Douat,
Tapio Pahikkala, and Tero Aittokallio. Sandor, your boundless ideas and
deep mathematical knowledge continues to amaze me, and I have greatly
enjoyed our many discussions. Prson, thank you for taking the time to
validate my predictions in the lab. Jane, thank you for your enthusiasm
and commitment to your work. Tapio and Tero, I am deeply grateful for
your fantastic ideas, expertise and guidance, which were integral to the
success of this work. I am truly grateful for the unique contributions each
of you brought to this work.
I would also like to thank all co-authors from the metabolomics research
conducted at Nightingale Health, especially Kirsten Schut, Sini Kerminen,
7
Acknowledgements
Valtteri Mäkelä, Kristian Nybo, Jussi Nokso-Koivisto, Sara Lundgren,
Nurlan Kerimov, Luke Jostins-Dean, Mika Tiainen, Harri Koskela, Eline
Slagboom, Antti Kangas, Pasi Soininen, Peter Würtz, and Jeffrey Barrett.
It has been a pleasure to work with all of you. Sini and Kirsten, you have
both been wonderful colleagues and friends, and I admire your enthusiasm
and dedication to your work. I also wish to extend my thanks to all other
colleagues at Nightingale Health, who made many aspects of this work
possible. A special thanks to Tuija, Salla, Valtteri, Kristian, Jussi, Sara,
Nurlan, Joni, Vilma, Emmi, Ella, Juuso, and many others for making my
time there so memorable. I also wish to thank the founders of Nightingale
Health —Antti, Pasi, Peter and Teemu— for creating an exceptional environment for innovation that enabled me to contribute to impactful and
world-leading research projects.
I thank my pre-examiners, Ron Do and Taru Tukiainen, for their thorough evaluation of this work and for their insightful and encouraging
feedback. I am also deeply grateful to Maik Pietzner for accepting the role
of opponent and for dedicating the time to engage with my research and
this process, it is an honour to have you as the opponent.
I thank those who facilitated the collection and processing of the datasets
used in this research. Special thanks to the UK Biobank Laboratory
and Data Access Teams for the seamless collaboration in creating the
metabolomics datasets now accessible to researchers worldwide. I am also
grateful to the UK Biobank study participants, whose contributions were
essential to this work. I would also like to acknowledge Aalto University’s
CS-IT team for providing the computational resources that supported
many of the machine learning aspects in this work.
Along the way, I have had the privilege of working with many wonderful
people. I am deeply grateful to all my past and present colleagues and
collaborators for the enriching discussions and memorable moments we
have shared. To my colleagues at the Aalto Computer Science department
—Maryam, Riikka, Tian, Anchen, Gianmarco, Robert, Emily, Taneli, Vikas,
and Elena— thank you for the engaging discussions, group lunches, and
shared activities. A special thanks to Maryam and Riikka for your kindness
and shared office moments that have brightened my days.
Finally, to all my friends —those from earlier days, those met during my
university years, and those encountered at work— thank you for bringing
joy and balance to my life. To all my family, thank you for your belief in me
and for standing by me through every step. A special thanks to my sister
Henna for your constant encouragement, and to Peter for your unwavering
support and love. I am deeply grateful to each of you for being part of this
journey.
Helsinki, December 22, 2024,
Heli Julkunen
8
Contents
Acknowledgements
7
Contents
9
List of Publications
13
Author’s Contribution
15
1. Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 Risk prediction in precision medicine . . . . . . . .
1.1.2 Treatment strategies in precision medicine . . . . .
1.2 Research aims . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Summary of research contributions . . . . . . . . . . . . . . .
1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
19
20
21
22
24
26
2. Background
27
2.1 Molecular profiling in precision medicine . . . . . . . . . . . 27
2.2 Disease risk prediction in precision medicine . . . . . . . . . 28
2.2.1 Fundamentals of time-to-event modelling . . . . . . 29
2.2.2 Risk prediction in preventive healthcare . . . . . . 32
2.2.3 Emerging trends in risk prediction . . . . . . . . . . 33
2.3 Metabolomics for predicting disease risks . . . . . . . . . . . 34
2.3.1 Human metabolome . . . . . . . . . . . . . . . . . . . 34
2.3.2 High-throughput profiling of metabolites . . . . . . 35
2.3.3 Prior research in metabolomics and disease risk . 36
2.4 Treatment strategies in precision medicine . . . . . . . . . . 38
2.4.1 Drugs and drug targets . . . . . . . . . . . . . . . . . 38
2.4.2 Drug combination treatments . . . . . . . . . . . . . 39
2.4.3 Quantifying drug combination effects . . . . . . . . 39
2.4.4 Prior research in predicting drug combination effects 41
2.5 Machine learning and statistical inference . . . . . . . . . . 43
2.5.1 Multiple regression . . . . . . . . . . . . . . . . . . . . 44
9
Contents
Logistic regression . . . . . . . . . . . . . . . . .
Cox proportional hazards regression . . . . . .
Statistical inference and interpretation . . . .
Regularization . . . . . . . . . . . . . . . . . . . .
Interactions . . . . . . . . . . . . . . . . . . . . . .
Factorization machines . . . . . . . . . . . . . . . . .
Standard formulation of factorization machines
Higher-order factorization machines . . . . . .
45
46
47
48
49
50
50
52
3. Predictive modelling of drug combination effects (Publication I)
3.1 Foundations of comboFM . . . . . . . . . . . . . . . . . . . . .
3.2 Drug combination dataset . . . . . . . . . . . . . . . . . . . . .
3.3 Evaluation settings . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Accurate predictions of drug combination effects . . . . . .
3.5 Experimental validation of predicted drug combinations .
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
55
56
56
57
58
58
2.5.2
4. Predictive modelling of disease risks using metabolomic
biomarkers (Publications II-IV)
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 UK Biobank . . . . . . . . . . . . . . . . . . . . . . . .
4.1.2 THL Biobank . . . . . . . . . . . . . . . . . . . . . . .
4.1.3 Estonian Biobank . . . . . . . . . . . . . . . . . . . . .
4.1.4 NMR metalomic biomarker profiling . . . . . . . . .
4.2 Predictive modelling of severe infectious disease and COVID19 risk using metabolomic biomarkers (Publication II) . . .
4.2.1 Study setting and methodology . . . . . . . . . . . .
4.2.2 Associations of individual metabolomic biomarkers
with severe pneumonia and COVID-19 . . . . . . .
4.2.3 Multi-biomarker score stratifies the risk of severe
infectious diseases . . . . . . . . . . . . . . . . . . . .
4.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Systematic characterization of the associations of metabolomic biomarkers across common diseases (Publication
III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Study setting and methodology . . . . . . . . . . . .
4.3.2 Biomarker associations across a broad range of
diseases . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Insights into shared biomarker signatures . . . . .
4.3.4 Accounting for the effects of lipid lowering medications . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Metabolomic and genomic prediction of common diseases
(Publication IV) . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
61
61
61
62
62
62
63
63
64
64
67
68
68
69
69
72
72
73
Contents
4.4.1
4.4.2
4.4.3
Study setting and methodology . . . . . . . . . . . .
Metabolomic risk scores stratify disease risk . . . .
Conclusions . . . . . . . . . . . . . . . . . . . . . . . .
73
74
76
5. Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
5.1 Foundations of survivalFM . . . . . . . . . . . . . . . . . . . .
5.2 Study population and evaluation settings . . . . . . . . . . .
5.3 Improved prediction of disease risk across various settings
5.4 Enhanced cardiovascular risk prediction performance . . .
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
77
78
78
81
81
6. Concluding remarks
83
References
87
Publications
107
11
List of Publications
This thesis consists of an overview and of the following publications which
are referred to in the text by their Roman numerals.
I Heli Julkunen, Anna Cichońska, Prson Gautam, Sandor Szedmak,
Jane Douat, Tapio Pahikkala, Tero Aittokallio, Juho Rousu. Leveraging multi-way interactions for systematic prediction of pre-clinical drug
combination effects. Nature Communications, December 2020.
II Heli Julkunen, Anna Cichońska, P. Eline Slagboom, Peter Würtz,
Nightingale Health UK Biobank Initiative. Metabolic biomarker profiling
for identification of susceptibility to severe pneumonia and COVID-19 in
the general population. eLife, May 2021.
III Heli Julkunen, Anna Cichońska, Mika Tiainen, Harri Koskela, Kristian Nybo, Valtteri Mäkelä, Jussi Nokso-Koivisto, Kati Kristiansson,
Markus Perola, Veikko Salomaa, Pekka Jousilahti, Annamari Lundqvist,
Antti J. Kangas, Pasi Soininen, Jeffrey C. Barrett, Peter Würtz. Atlas of
plasma NMR biomarkers for health and disease in 118,461 individuals
from the UK Biobank. Nature Communications, February 2023.
IV Nightingale Health Biobank Collaborative Group. Metabolomic and
genomic prediction of common diseases in 700,217 participants in three
national biobanks. Nature Communications, November 2024.
List of authors in alphabetical order: Jeffrey C. Barrett, Tõnu Esko,
Krista Fischer, Luke Jostins-Dean, Pekka Jousilahti, Heli Julkunen,
Tuija Jääskeläinen, Antti Kangas, Nurlan Kerimov, Sini Kerminen, Anastassia Kolde, Harri Koskela, Jaanika Kronberg, Sara N. Lundgren, Annamari Lundqvist, Valtteri Mäkelä, Kristian Nybo, Markus Perola, Veikko
Salomaa, Kirsten Schut, Maiju Soikkeli, Pasi Soininen, Mika Tiainen,
Taavi Tillmann, Peter Würtz.
13
List of Publications
V Heli Julkunen, Juho Rousu. Machine learning for comprehensive
interaction modelling improves disease risk prediction in the UK Biobank.
Submitted, July 2024. Available on medRxiv.
14
Author’s Contribution
Publication I: “Leveraging multi-way interactions for systematic
prediction of pre-clinical drug combination effects”
The research project was initiated and conceptualized in collaboration with
me, Anna Cichońska, Sandor Szedmak, Tapio Pahikkala, Tero Aittokallio,
and Juho Rousu. I had a primary role in designing and implementing the
comboFM machine learning framework and performing the computational
analyses, with input from Anna Cichońska. The design of the computational analyses and evaluation protocols was shaped through collaborative
input from Anna Cichońska, Sandor Szedmak, Tapio Pahikkala, and Juho
Rousu. The random forest comparison experiment was performed by Jane
Douat under my supervision. I prepared all the figures. Based on the
computational predictions, experimental wet-lab validation of drug combinations was designed and performed by Prson Gautam. The results
were analyzed jointly by the authors. The initial draft of the article was
written by me, and then revised and edited by Anna Cichońska, Tero Aittokallio and Juho Rousu, with contributions from Sandor Szedmak and
Tapio Pahikkala.
Publication II: “Metabolic biomarker profiling for identification of
susceptibility to severe pneumonia and COVID-19 in the general
population”
The research was conceptualized and designed jointly by all authors. The
data from UK Biobank was curated and processed jointly by me and Anna
Cichońska. The computational and statistical analyses were implemented
and performed mainly by me. The results were jointly interpreted by all
authors. I prepared all figures except Figure 9, which was prepared by
Anna Cichońska. All authors contributed to writing the article.
15
Author’s Contribution
Publication III: “Atlas of plasma NMR biomarkers for health and
disease in 118,461 individuals from the UK Biobank”
The research was designed and conceptualized in collaboration with me,
Anna Cichońska, Antti Kangas, Pasi Soininen, Jeffrey Barrett and Peter
Würtz. The data from UK Biobank was curated and processed jointly by
me and Anna Cichońska. The computational and statistical analyses were
implemented and performed mainly by me. The online tool to visualize
the results and query the summary statistics was implemented by me.
I prepared all the figures. Mika Tiainen, Harri Koskela, Kristian Nybo,
Valtteri Mäkelä, Jussi Nokso-Koivisto, Pasi Soininen and Antti Kangas
performed the NMR metabolomic biomarker measurements, quantification
and quality control. Kati Kristiansson, Markus Perola, Veikko Salomaa,
Pekka Jousilahti, Annamari Lundqvist contributed to data collection at
the THL Biobank, which was used for replication analyses. The results
were jointly interpreted and the article written by me, Anna Cichońska,
Jeffrey Barrett and Peter Würtz.
Publication IV: “Metabolomic and genomic prediction of common
diseases in 700,217 participants in three national biobanks”
This project was carried out in collaboration with three major biobanks,
comprising UK Biobank, Estonian Biobank and THL Biobank, and included investigators from five institutions. The research was conceptualized primarily by me, Peter Würtz and Jeffrey Barrett. The data was
curated and statistical analyses performed by me, Nurlan Kerimov, Sini
Kerminen, Sara Lundgren, Kirsten Schut and Luke Jostins-Dean. My
primary contribution was in designing methodology for the study, implementing the computational frameworks and pipelines for model training
and evaluation, and conducting cross-biobank analyses of the metabolomic
biomarker scores. This included analyses of the model performance and
calibration, along with the related visualizations. Analyses and visualizations related to polygenic risk scores, clinical risk factors and multiple time
points were performed by Sini Kerminen, Sara Lundgren, Kirsten Schut
and Luke Jostins-Dean. After the initial submission of the article, I transitioned to Aalto University, and the subsequent revisions have been carried
out by the other co-authors. Harri Koskela, Valtteri Mäkelä, Kristian Nybo,
Maiju Sokkeli and Pasi Soininen performed NMR metabolomic biomarker
measurements, quantification and quality control. The manuscript was
written by Jeffrey Barrett, with contributions from me, Luke Jostins-Dean,
Peter Würtz, Sini Kerminen, Sara Lundgren, Nurlan Kerimov and Kirsten
Schut. Detailed contributions from all authors are given in the original
publication.
16
Author’s Contribution
Publication V: “Machine learning for comprehensive interaction
modelling improves disease risk prediction in the UK Biobank”
I conceived the idea of developing a machine learning method for survival
analysis to account for the comprehensive interaction effects among predictor variables using concepts derived from factorization machines applied
in Publication I. The methodology was mainly designed by me, with input
from Juho Rousu. The R package implementation of the method was written by me. The data from UK Biobank was curated and processed by me.
The computational analyses were designed, implemented and performed
by me. I analyzed the results and prepared all the figures. The manuscript
was written mainly by me.
17
1. Introduction
1.1 Motivation
Precision medicine is increasingly regarded as the future of healthcare,
where prevention and treatment strategies are implemented by accounting for the unique characteristics of individual patients or subgroups of
patients. This concept has also gained attention from policymakers; for
instance, in 2015, former U.S. President Barack Obama launched the
Precision Medicine Initiative, aimed at "delivering the right treatment
at the right time, every time, to the right person" 1,2 . Similar initiatives
have emerged globally 3–6 , reflecting the growing recognition that precision
medicine has the potential to reduce healthcare costs, improve patient
outcomes, and enable more effective, targeted interventions 7–9 .
The concept of precision medicine is not new; for instance, blood types
have been used to personalize blood transfusions for over a century 10 .
However, the prospect of widely applying this concept has been notably
improved by the advances in molecular profiling ’omics technologies, such
as genomics, transriptomics, proteomics and metabolomics, which have
increased the amount of molecular data that can be collected for each
individual patient. The improved scalability and reduced costs of these
platforms have facilitated their widespread adoption in research settings
and are beginning to contribute to their use in clinical practice 11,12 . These
advancements, coupled with the development of computational methods
for analyzing the vast amounts of generated data, have enhanced our
understanding of the molecular alterations underlying disease development. Consequently, this has created opportunities for discovering effective
treatments, identifying disease biomarkers, and developing risk prediction
models 8,13–15 .
The challenge lies in translating the continuously increasing volumes
of complex data into actionable insights for precision medicine. Computational methods, such as machine learning, are crucial in this endeavor, as
19
Introduction
they enable the integration, analysis, and interpretation of vast, heterogeneous datasets. Machine learning builds upon statistical learning theory
to learn patterns from observed data to predict outcomes in previously
unseen instances. In the context of precision medicine, machine learning
methods can be applied, for instance, to predict disease risks and responses
to treatments, based on the unique molecular and clinical characteristics
of the patients 16–19 .
This dissertation develops machine learning frameworks and performs
statistical analyses to contribute to various aspects of precision medicine,
leveraging recently emerged extensive biomedical data collections. The
research focuses on two primary themes: (1) improving disease risk prediction, particularly through the use of metabolomic biomarkers and the
development of machine learning methods for risk modelling, and (2) advancing the discovery of effective treatments by developing a machine
learning framework to predict the effects of drug combination therapies.
Hence, this dissertation contributes to advancing both prevention and
treatment aspects of precision medicine.
1.1.1
Risk prediction in precision medicine
Identifying undiagnosed individuals at an elevated risk of developing a
disease is essential for precision medicine to enable targeted interventions
that prevent or delay disease onset. For instance, in current clinical practice, cardiovascular risk prediction models are widely used to guide the
allocation of lipid-lowering treatments based on factors like cholesterol levels, blood pressure and age 20–22 . However, risk prediction can be improved
and extended to a wider range of diseases using the ’omics technologies.
This is exemplified by polygenic risk scores, which aggregate data from
numerous genetic variants to estimate an individual’s susceptibility to
disease. The success of polygenic risk scores has been largely driven by
the availability of genomic data at scale 23,24 . As other types of omics data
become similarly accessible, they also present potential for risk prediction
and may provide even greater prediction accuracy.
Metabolomics, the comprehensive profiling of metabolites, has also gained
traction as a promising tool for molecular profiling of disease risk. Similar
to many routine clinical risk factors for chronic diseases, the levels of
metabolites reflect the broad downstream effects of genetic, environmental,
and lifestyle factors 25 , making them attractive candidates for risk prediction. Some individual blood metabolites are already routinely used in
clinical practice; for instance, glucose is a marker for type 2 diabetes, creatinine is used for evaluating kidney function and cholesterol levels are used
to evaluate cardiovascular disease risk. However, detailed metabolomic
profiling holds promise to further improve risk prediction. Research studies have widely established associations of metabolomic biomarkers with
20
Introduction
the risk of cardiometabolic diseases 26–28 , and emerging evidence suggests
they may play a broader role in overall human health and disease 29,30 .
To fully establish the potential of metabolomics in risk prediction, it is
essential to incorporate metabolomic profiling data into large prospective
cohort studies, as they provide the extensive longitudinal data required to
identify and validate reliable associations with disease risk. Achieving this
requires mature metabolomics platforms capable of profiling large datasets
with high consistency and reproducibility. Among the available technologies, nuclear magnetic resonance (NMR) spectroscopy has emerged as an
appealing platform due to its ability to reproducibly quantify abundant
circulating metabolites at relatively low cost and high throughput 28,31 .
Given its scalability and reproducibility, NMR is particularly well-suited
for large-scale studies, addressing key requirements for both epidemiological research and eventual clinical translation. Therefore, applying
NMR-based metabolomic profiling in extensive cohort studies like the UK
Biobank presents a unique opportunity to explore the broader applicability
of metabolomics for risk prediction across a wide range of diseases.
In addition to informative data, the accuracy of risk prediction also
depends on the computational methods used to derive the prediction model.
Since disease risks are often time-to-event outcomes, it is essential to
employ appropriate modeling techniques that account for censored data
and varying follow-up times. Established methods in epidemiological
research, such as Cox proportional hazards regression 32 , are widely used
but have limitations in capturing non-linear relationships and interactions
present in complex biological data. Advanced machine learning methods,
such as random survival forests 33 and deep survival models 34,35 , are
better equipped to handle such complexities. However, they often sacrifice
interpretability, which is often desired for translational applications 36–39 .
Therefore, there is a need for methods that balance the ability to model
complex relationships with the interpretability necessary for translational
risk prediction applications.
1.1.2
Treatment strategies in precision medicine
When disease prevention is unattainable, precision medicine aims to optimize treatment by taking into account individual characteristics of the patient. Improved understanding of the molecular underpinnings of diseases
has driven the discovery of new drug targets and enabled the development
of targeted therapies, which selectively interfere with the molecular drivers
of disease progression 40,41 . Targeted therapies have shown remarkable
success particularly in various cancers, where research into the molecular alterations underlying various subtypes of cancer has uncovered key
drivers of tumor growth 42–45 . By targeting these molecular drivers, such
therapies can halt cancer cell proliferation while minimizing damage to
21
Introduction
healthy tissue, resulting in less side effects and providing a safer alternative to conventional cytotoxic chemotherapies. Consequently, targeted
drugs have become a major therapeutic class in cancer treatment, improving response rates and prolonging survival in patients with corresponding
molecular alterations 42 .
Despite the transformative impact of targeted therapies on cancer treatment, they are not universally effective. A notable challenge to the success
of targeted therapies is the development of drug resistance, which often
arises as cancer cells adapt to the selective pressure imposed by singletarget therapies 46–48 . As a result, relying on the modulation of only a single
target often proves insufficient. To overcome this limitation, combination
drug therapies have emerged as a promising strategy and are increasingly
used in treatment of many cancers and other complex diseases 48–50 . In addition to overcoming single-drug resistance, drug combination treatments
can also improve therapeutic efficacy by acting through different molecular
targets or mechanisms, as well as potentially decrease the treatmentrelated toxicity by lowering the doses of the drugs needed to generate the
same response 42,47,48,51 .
The importance of combination therapies in cancer treatment is evident,
with regulatory bodies such as the U.S. Food and Drug Administration
(FDA) approving 81 new drug combinations for oncology between 2011 and
2021 52 . However, identifying effective drug combinations presents a notable challenge, as the number of potential drug combinations far exceeds
what can be feasibly tested in a clinical settings. Although extensive highthroughput drug combination screens have been conducted for various
cancer types 53–55 , computational approaches, such as machine learning,
will be essential in narrowing down the extensive combinatorial possibilities 56,57 . However, accurately modeling the effects of drug combinations is
challenging due to the complexity of responses across varying doses and
the molecular heterogeneity of cancer. This creates a need for methods
that can accurately predict drug combination effects across different cancer
subtypes and dosing regimens.
1.2 Research aims
The research presented in this dissertation is organized into three main
chapters, each corresponding to specific research aims.
Despite the promise of
machine learning in identifying effective drug combinations, a majority of
the existing methods overlook the fact that the effects of drug combinations
can vary depending on their doses. Most existing approaches simplify the
problem to either a binary or continuous regression problem of synergy,
aiming to determine whether the combined summary effect is greater
Predictive modelling of drug combination effects
22
Introduction
than what would be expected from the individual drugs 58,59 . Since a drug
combination that is synergistic at one dose could be antagonistic or additive
at another, there is a need for methods that can predict combination effects
at specific doses. This is also essential for translating the predictions into
clinical practice, as lower doses are often better tolerated by the patients.
Addressing this challenge was the motivation behind the first research
aim:
• Research aim 1: Develop a machine learning framework capable of
predicting drug combination responses at the level of individual doseresponses.
Given
the potential of metabolomics to inform on disease risks, this dissertation
builds upon a novel NMR metabolomics dataset from the UK Biobank,
containing data for over 100,000 individuals. This metabolomics dataset is
the largest of its kind to date and provides a unique opportunity to assess
the relevance of metabolomic biomarkers at scale and across a diverse
array of diseases.
The first study using this dataset was prompted by the global health
concern posed by the coronavirus pandemic in 2019 (COVID-19). While
previous research had established older age and chronic health conditions
as risk factors for severe infections 60,61 , the role of molecular factors like
metabolomic and inflammatory biomarkers on susceptibility remained
unclear. This led to the formulation of the second research aim:
Predictive modelling of disease risks using metabolomic biomarkers
• Research aim 2: Evaluate the potential of metabolomic biomarkers to
inform on the susceptibility to severe infectious diseases and COVID-19.
To further understand the broader impact of the metabolomic biomarkers
on health and disease, we sought to systematically analyze the associations
of individual metabolomic biomarkers across all common diseases, extending also beyond those of cardiometabolic origin. This led to the formulation
of the third research aim:
• Research aim 3: Systematically characterize associations between
individual metabolites and disease risk across a broad range of common
diseases.
The strong associations observed across various diseases motivated us
to assess the ability of risk prediction models derived using the metabolomic biomarkers to predict leading chronic diseases with significant
public health relevance. This further raised important questions about the
transferability of such risk prediction model across different population
23
Introduction
cohorts. These considerations led to the formulation of the subsequent
research aim:
• Research aim 4: Evaluate the potential of metabolomic risk scores
in predicting the risk of leading chronic diseases and investigate their
transferability across different population cohorts.
Machine learning with comprehensive interaction modelling for disease risk
prediction One challenge in building accurate disease risk prediction
models is accounting for the complex relationships that can influence outcomes in non-linear ways, such as interactions among predictor variables.
Inspired from the results in Publications II-IV and the method proven
effective in Publication I, we sought to explore whether the technique
proven effective in modelling interactions in drug combination data from
the first research aim, could be adapted to modelling interactions among
predictor variables for time-to-event outcomes, such as disease risks. This
culminated in the formulation of the final research aim:
• Research aim 5: Develop a machine learning method to account for
comprehensive interaction effects among predictor variables in modelling
time-to-event outcomes.
1.3 Summary of research contributions
The main contributions of this dissertation in terms of key results and
impact are given in the original Publications I-V and summarised below.
• In Publication I, we developed comboFM, a machine learning framework to predict dose-specific drug combination responses. While earlier
methods had primarily addressed direct prediction of drug combination
synergy 58,59 , comboFM enables detailed dose-specific response predictions with the ability to subsequently evaluate synergies at different
dose combinations. comboFM learns from experimental drug combination response data by using higher-order factorization machines 62 ,
a recently proposed machine learning technique that can account for
comprehensive interactions in large datasets. Our computational experiments using data from a cancer cell line pharmacogenomic screen
demonstrated that comboFM obtains high prediction accuracy in various practical prediction scenarios. The practical utility of comboFM
was further confirmed through experimental validation of previously
untested drug combinations, demonstrating its potential to advance the
development of combination therapies.
24
Introduction
• In Publication II, we explored whether metabolomic biomarkers measured by NMR could predict susceptibility to severe pneumonia and
COVID-19. Our findings revealed new molecular insights into the susceptibility to severe infections. In addition to inflammatory biomarkers,
many blood biomarkers previously associated mainly with cardiometabolic
diseases, such as amino acids and fatty acids, were also predictive of hospitalization for severe infections years later. We derived a metabolomic
risk score, a risk prediction model based on metabolomic biomarkers, and
showed its strong association with increased risk of both severe pneumonia and severe COVID-19, even up to a decade after the initial blood
samples were collected. These findings highlight the potential of NMR
metabolomic profiling to complement existing risk identification tools
and improve understanding of the molecular mechanisms underlying
susceptibility to severe infections.
• In Publication III, we systematically characterized the associations of
NMR metabolomic biomarkers across the incidence, prevalence and mortality of hundreds of common diseases in the UK Biobank. Our findings revealed a broad range of novel biomarker associations, including
risks of various cancers, mental health outcomes, and musculoskeletal
disorders. While earlier research had primarily linked these biomarkers to cardiometabolic conditions 26–28 , this study demonstrates their
broader relevance across a wide range of diseases and in a uniquely
large population-scale setting. Additionally, we identified both similarities and differences in biomarker association patterns across various
diseases, suggesting underlying systemic connections as well as distinct
molecular signatures unique to certain conditions. Furthermore, the
associations were shown to widely replicate in THL Biobank. Our results
highlight the potential of these biomarkers to provide valuable insights
into multi-disease risk.
• In Publication IV, motivated by the results from Publications II-III, we
derived metabolomic risk scores to predict the risk of 12 leading causes
of morbidity. This study leveraged a uniquely large population sample
from three major national biobanks, UK Biobank, Estonian Biobank, and
Finnish THL Biobank, totaling over 700,000 participants. This unique
setup enabled the validation of metabolomic risk scores across multiple
population cohorts, demonstrating consistent cross-biobank replication.
This study was also the first to systematically compare the predictive
performance of metabolomic risk scores and polygenic risk scores across
multiple diseases within such a large population. We showed that metabolomic risk scores exhibited stronger associations with future disease
risk than polygenic scores for the majority of diseases studied. These
results demonstrate the potential of metabolomic biomarkers to inform
25
Introduction
disease risk prediction.
• In Publication V, we introduced survivalFM, a novel machine learning method for time-to-event risk prediction. This method effectively
accounts for comprehensive interactions among predictor variables. It utilizes a low-rank factorized parametrization approach, a concept adapted
from factorization machines applied for drug combination prediction
in Publication I. This approach facilitates simultaneous estimation of
comprehensive interaction effects, even with numerous predictor variables, for predicting time-to-event outcomes, such as disease risks. In
contrast to many other advanced machine learning approaches in survival analysis 33–35 , survivalFM produces interpretable models, which
is often important for translational applications. Using data from the
UK Biobank, we demonstrated that survivalFM improves risk prediction
performance across various data sources and disease outcomes, including a practical cardiovascular risk prediction scenario. These findings
highlight the potential of survivalFM to provide more nuanced insights
into disease risk and to enhance the accuracy of risk predictions.
1.4 Outline
This dissertation is structured as follows. Chapter 2 provides the biological
and computational background related to the various aspects of precision
medicine addressed in this dissertation. It establishes the current state
of the field and justifies the research aims outlined in the introduction.
Chapters 3, 4, and 5 present the research contributions from of Publications
I-V, organized according to the main themes: i) predictive modelling of
drug combination effects, ii) predictive modelling of disease risks using
metabolomic biomarkers and iii) extending risk prediction methodology
by accounting for comprehensive interaction effects. Finally, Chapter 6
provides conclusions and discusses potential future research directions.
26
2. Background
2.1 Molecular profiling in precision medicine
Broad molecular profiling, commonly referred to as ’omics, refers to a set
of high-throughput experimental technologies and related scientific fields
dedicated to a large-scale, quantitative analysis of molecular data, particularly as they relate to human health and disease. The suffix -ome was
initially used in the context of genome studies, referring to the complete
set of genetic material within an organism. Over time, -ome has come
to denote the totality of various biological entities, and correspondingly,
’omics has become a general term for the study of comprehensive biological
datasets. The concept of ’omics cascade illustrates the hierarchical flow
of biological information, from DNA (genome) to RNA (transcriptome),
proteins (proteome), and metabolites (metabolome), each layer offering
different insights into biological processes (Figure 2.1).
’Omics technologies have provided precision medicine with the tools
to operate at a detailed molecular level. These technologies, especially
in the field of genomics, have already driven translational research and
are gradually beginning to contribute to clinical practice. For instance,
genome-wide association studies (GWAS) have provided insights into the
genetic basis of various complex diseases 63,64 . These findings have not only
enriched our understanding of the genetic architecture of disease but also
proven valuable for therapeutic development, as targets backed by genetic
evidence are more likely to succeed in clinical trials 65,66 . Furthermore,
GWAS findings are now being utilized in other applications, such as in
genetic risk prediction using polygenic risk scores, which may soon be
incorporated as stratification tools in clinical settings 23,67–69 .
As ’omics technologies become more affordable and scalable, including
those that measure biological entities closer to the phenotype (e.g., the
metabolome and proteome), they enable a more comprehensive view of
the molecular effects related to disease outcomes. Such developments will
27
Background
advance precision medicine by enabling the identi cation of previously
undetected biomarkers, re ning disease subtyping, improving the accuracy
of risk prediction models, and informing the development of more effective
therapeutic strategies 8,13 15 .
Genomics
DNA
Transcription
Transcriptomics
T
RNA
Translation
Proteomics
Proteins
Protein activity
Metabolomics
Metabolites
Phenotype (disease)
Figure 2.1. ’Omics cascade illustrating the flow of biological information from
genomics through transcriptomics and proteomics to metabolomics.
2.2
Disease risk prediction in precision medicine
Disease risk prediction refers to the concept of estimating the likelihood
that an individual will develop a particular disease or condition within
a speci ed time frame, based on the analysis of one or more risk factors.
These factors can include, for instance, environmental, behavioral, and
clinical attributes, or molecular biomarkers measured through traditional
laboratory chemistry or emerging omics technologies. Prediction can be
done based on a single risk factor, such as a biomarker that directly correlates with disease risk, or it can involve complex statistical or machine
learning models that integrate multiple risk factors to provide an estimation of disease risk. The ultimate goal of disease risk prediction is to
enable precision medicine by stratifying individuals into different levels of
risk. By early identi cation of individuals at high risk, targeted prevention
strategies and interventions can be implemented, potentially altering the
disease trajectory and improving health outcomes.
28
Background
In this dissertation, the main focus is placed on predicting the first onset
of disease in currently undiagnosed individuals. However, risk prediction
can also be applied in the context of predicting disease progression or
recurrence.
2.2.1
Fundamentals of time-to-event modelling
Time-to-event analysis, also known as survival analysis, is a branch of
statistics and machine learning which deals with estimating the time
until the occurrence of an event, such as disease onset, based on given
characteristics of an individual.
Survival analysis involves time-to-event
data, which measures the duration from the start of a follow-up period to
the occurrence of a specified event, such as the onset of a disease. A central
feature of such data is the issue of censoring, which arises when the precise
timing of the event remains unknown. This can occur if the study ends
before the event occurs, or if an individual is lost to follow-up. The most
frequently encountered type of censoring is right-censoring, which occurs
when the event has not taken place by the end of the follow-up period.
Other forms include left-censoring, where the event takes place before the
individual is enrolled in the study, and interval-censoring, where the event
is known to have occurred within a specified time interval, though the
precise time remains unknown.
This dissertation focuses on the analysis of right-censored data (Figure
2.2). In such cases, the outcome is represented by two variables: the
event of interest, such as disease onset, and the time from the start of the
follow-up period until either the event occurs or the individual is censored.
Formally, the time-to-event dataset D is defined as a collection of tuples
D = {(x i , T i , ± i )} N
i =1 , where x i denotes a vector of predictors characterizing
individual i , T i represents the observed time until the occurrence of the
event or censoring, and ± i is an indicator variable that takes the value of 1
if T i corresponds to an observed event and 0 if it is a censored observation.
Time-to-event data and censoring
In survival analysis, two central functions
are used to model time-to-event data: the survival function and the hazard
function. The survival function, denoted as S(t), estimates the probability
that an individual will survive beyond a specified time point t. It relates to
the cumulative distribution function (CDF) of the event times, F(t), which
quantifies the cumulative risk of the event occurring by time t:
Survival and hazard functions
S(t) = P(T < t) = 1 ° F(t)
Rt
(2.1)
where 0 ∑ S(t) ∑ 1, and F(t) = 0 p(z), dz, with p(t) representing the probability density function (PDF) that characterizes the likelihood of the event
occurring at a specific time t.
29
Background
Event ( = 1)
Censored observation ( = 0)
Baseline
Follow-up period
End of follow-up period
t = T1
t=0
t=0
t = T2
t=0
t = TN
Time
Predictors
xi
Risk model
g(xi)
Estimated risk
h(t|xi), S(t|xi)
Figure 2.2. Right-censored survival data in time-to-event modelling. Individuals
are observed over a follow-up period, during which the occurrences
of the event of interest is recorded. Right-censoring arises when
the event has not happened by the end of the follow-up period or
if individuals are lost to follow-up. Predictor variables, x i , for an
individual i are measured at baseline and are used to estimate the
individual s risk using a risk model g(x i ). This risk estimate can be
expressed through functions such as the hazard function h(t|x i ) or
the survival function S(t|x i ), which quantify the likelihood of the event
occurring over time.
The hazard function, h(t), provides a complementary view by describing
the instantaneous risk of the event occurring at time t, conditional on
survival up to that point:
h(t) =
p(t)
P(t ≤ T < t + Δ t|T ≥ t)
= lim
S(t) Δ t→0
Δt
(2.2)
Together, the survival and hazard functions provide complementary insights into the timing and likelihood of events. While the survival function
typically decreases over time as the probability of survival diminishes, the
hazard function may vary depending on how the risk evolves over time.
Time-to-event risk prediction models
are typically developed using data from epidemiological cohort studies,
which collect baseline measurements from participants at the beginning of
the study ( t = 0), capturing a comprehensive range of predictor variables x
(Figure 2.2). These variables typically include sociodemographic factors,
lifestyle behaviors, and family medical history, obtained through questionnaires, along with physiological and clinical measurements, such as blood
pressure and body mass index. Additionally, biological samples, such as
blood, are often collected and stored for subsequent laboratory analyses,
Data sources and predictor variables
30
Background
including various ’omics measurements.
Participants are then followed longitudinally, with both the occurrence
and timing of relevant events recorded, generating time-to-event data for
risk modeling. In modern cohort studies, follow-up data is often linked
from electronic health records, providing continuous and detailed tracking of health outcomes over time. The duration of follow-up for disease
risk modeling typically ranges from 5 to 10 years, though this can vary
depending on the objectives and design of the study.
Disease events captured in cohort studies can be categorized into prevalent events and incident events. Prevalent events refer to occurrences that
have taken place before the participant enrolled in the study, whereas
incident events are those that arise during the follow-up period. Risk
prediction models are primarily concerned with incident events, as they
represent new cases of disease that develop after the baseline assessment.
Prevalent events are typically excluded from risk prediction analyses, as
individuals with existing disease have already experienced the outcome,
complicating baseline risk assessment and confounding the analysis.
The objective of time-to-event
risk prediction is to accurately predict survival, cumulative risk, or hazard
functions for a given individual, at any time of interest t based on a set of
predictor variables (Figure 2.2). Typically, such models are expressed in
terms of the hazard function h(t|x). However, since the survival, cumulative
risk, probability density and hazard functions are mathematically related,
knowing one allows derivation of the others. One of the most widely used
methods for modelling time-to-event data is the Cox proportional hazards
regression model 32 , which will be covered in detail in Section 2.5.1.
Despite the widespread use of the Cox proportional hazards regression,
it also has limitations, particularly its assumption of linear relationships
between predictor variables and the log-hazard function. To address these
limitations, various alternative survival analysis methods have been developed. Advanced machine learning techniques, such as random survival
forests 33 and deep survival models 34,35 are capable of capturing complex
interactions and non-linear relationships within the data. While these
methods can potentially improve predictive performance, they also come
with trade-offs, such as increased computational complexity and the need
for extensive hyperparameter tuning. More importantly, these models
sacrifice interpretability, which is often needed in clinical decision making where understanding the risk factors underlying the prediction is
essential 36–39 . Furthermore, some studies suggest that the performance
improvements offered by these advanced methods are modest and require large sample sizes to be effective 70,71 . Therefore, there is a need for
methods that balance the ability to model complex relationships with the
interpretability needed for translational applications.
Methods in time-to-event risk modelling
31
Background
The predictive performance of risk
prediction models generally depends on the outcome being modeled, the
population to which the model is applied, the amount of information the
predictors provide about the outcome, and the suitability of the modelling
methodology. Once a prediction model is developed, its performance must
be evaluated using different metrics. Some of the most common metrics
used for this purpose are discrimination, calibration, and reclassification.
Discrimination measures how well the model distinguishes between individuals who will experience the outcome and those who will not, typically
assessed by the area under the ROC curve (AUC) or the concordance index
(C-index) 72 . Calibration evaluates how closely the predicted probabilities
align with the actual outcomes. A well-calibrated model should have predictions that match the observed risk of events. Reclassification assesses
whether a new model improves risk classification. For instance, the net
reclassification index (NRI) quantifies how well individuals are reclassified
into more appropriate risk categories, based on whether they eventually
experience the event or not 73 .
External validation is crucial for ensuring that the model generalizes to
different populations or settings. This involves applying the model to new
datasets to confirm that its performance remains consistent. Guidelines
like TRIPOD (Transparent Reporting of a Multivariable Prediction Model
for Individual Prognosis or Diagnosis) 74 provide best practices for the
development, validation, and transparent reporting of prediction models.
Evaluation of risk prediction models
2.2.2
Risk prediction in preventive healthcare
Advances in disease risk prediction have been shaped by large, longitudinal studies that have identified key risk factors and developed disease
risk prediction models. For instance, the Framingham Heart Study, initiated in 1948, was instrumental in establishing several risk factors for
cardiovascular disease (CVD) in the general population, such as smoking,
obesity, high cholesterol, and hypertension 75–77 . These findings, along
with contributions from other seminal studies 78,79 , provided the basis for
stratifying CVD prevention and treatment strategies by emphasizing the
role of individual risk factors. The Framingham Risk Score, developed as a
result of this work, remains widely used to estimate an individual’s CVD
risk and to guide interventions such as lipid-lowering therapies 77,80,81 .
Similar models have since been developed for other chronic diseases, such
as the Gail Model for breast cancer risk prediction 82 , QDiabetes risk score
for predicting type 2 diabete risk 83 , and the QCancer tool for assessing the
risks of common cancers 84 .
Cardiovascular medicine has been at the forefront of integrating risk
factor analysis and predictive models into routine clinical practice. This is
exemplified by the widespread development of CVD risk prediction models
32
Background
globally. In addition to the Framingham Risk Score, notable examples
include the SCORE model in Europe 22 , the QRISK model in the UK 20 ,
FINRISKI in Finland 85 , the ACC/AHA-ASCVD pooled cohort equations
in the US 86 , and the China-PAR model 87 . These CVD risk prediction
models have become an integral part of local clinical guidelines, facilitating
personalized prevention and treatment strategies, such as lifestyle changes
and lipid-lowering treatments 88,89 . However, while risk prediction has
been successfully integrated into preventative cardiovascular medicine
and serves as an example of precision medicine, this level of application
has largely not extended to other disease areas.
As the global burden of avoidable chronic health conditions is rising, the
need for effective prevention strategies has become increasingly important 90,91 . In response, healthcare systems have introduced preventative
health programs. For instance, the UK National Health Service (NHS)
offers a Health Check every five years to identify individuals at high risk
for conditions such as heart disease, kidney disease, type 2 diabetes, and
stroke, using a set of standard clinical risk factors 92,93 . However, there
is growing recognition that integrating ’omics data could enhance the accuracy of risk prediction and broaden its application to a wider range of
diseases. Due to the increased scalability and reduced costs brought by
technological advancements, these ’omics platforms are becoming more
practical and timely for wider use in healthcare systems 11,12 .
2.2.3
Emerging trends in risk prediction
The increasing availability of comprehensive datasets is driving advancements in disease risk prediction. Large-scale initiatives like the UK
Biobank 94 , FinnGen 95 , All of Us 96 , and Our Future Health 4 are providing vast resources that integrate electronic health records with genetic,
clinical, lifestyle, and multi-omics data from hundreds of thousands of participants. These datasets are valuable not only for investigating disease
risk factors at the population level but also for the development and validation of prediction models with translational potential. For instance, recent
studies have highlighted the value of utilizing electronic health records
from such biobanks to predict disease risk, offering another dimension of
health system-derived data for improving risk prediction 97–99 .
Among the emerging approaches in risk prediction, polygenic risk scores
(PRS) have gained considerable attention. PRS is a numerical value that
quantifies an individual’s genetic susceptibility to a specific trait or disease based on the effects of multiple genetic variants identified through
genome-wide association studies (GWAS). In recent years, PRSs have been
developed and tested for a range of diseases, including coronary artery
disease 23,100 , stroke 101 , chronic kidney disease 102 , type 2 diabetes 23,103
and common cancers 104–107 . While able to stratify population risk, es-
33
Background
pecially for individuals in the highest risk percentiles, their predictive
performance varies across diseases and populations. PRSs tend to perform
better in populations of European descent, where most GWAS data have
been generated, raising concerns about generalizability and equity in more
diverse populations 69,108 .
While polygenic risk scores have gained notable attention, other ’omics
data modalities also hold potential for disease risk prediction. For instance,
recent large-scale studies using plasma proteomics data from the UK
Biobank have demonstrated that proteomic risk scores can effectively
stratify risk for various common diseases and potentially improve the
prediction accuracy compared to traditional risk factors 109–111 . Another
promising area for risk prediction is metabolomics, which will be explored
for risk prediction in Publications II-IV. The following section will delve
into metabolomics, providing the necessary background for understanding
its role in disease risk prediction.
2.3
2.3.1
Metabolomics for predicting disease risks
Human metabolome
Analogous to genome, transcriptome and proteome, metabolome refers to
the complete set of metabolites found in an organism. Metabolites reflect
the downstream effects of genetic, transcriptomic, proteomic, and other
environmental factors, providing a detailed snapshot of the organism’s biochemical state close to the phenotype 112 . Current analytical platforms can
detect hundreds to thousands of metabolites depending on the technology
used, however, no single platform can capture the entire metabolome.
Metabolites encompass a wide range of molecules, serving as reactants,
intermediates, or end products of enzyme-mediated biochemical reactions.
The sources of metabolites can be divided into endogenous and exogenous
origins. Endogenous metabolites emerge from internal biochemical reactions, while exogenous metabolites are introduced into the system from
external vectors such as dietary intake or pharmacological treatments.
Consequently, metabolite levels are dynamic and reflect variations influenced not only by intrinsic factors such as genetics and disease states, but
also by extrinsic factors such as diet and medications 113–115 .
In terms of their chemical composition, metabolites exhibit stark differences when compared to genes, transcripts, and proteins. Unlike the
chemical homogeneity observed in nucleotides forming genes and transcripts, or amino acids constituting proteins, metabolites span an extensive
range of chemical classes. These can range from small hydrophilic sugars to large hydrophobic lipids. Additionally, the concentration ranges of
34
Background
metabolites can vary largely, with abundant metabolites being measurable
in the millimolar range, while less abundant ones are only detectable in
the picomolar range 116,117 .
Various sample types, including blood urine, cerebrospinal fluid and
saliva, can be used to study the human metabolome 118 . However, this
dissertation specifically focuses on blood metabolomics. Blood serves as the
primary transport medium for nutrients, metabolites, hormones, and waste
products throughout the body, offering a comprehensive reflection of the
metabolic activities occurring in different organs and tissues. This holistic
representation makes blood an ideal sample for metabolomics profiling,
particularly in the study of systemic diseases. Moreover, blood is easily
accessible through minimally invasive sampling techniques, making it a
practical and convenient option for both clinical and research settings.
2.3.2
High-throughput profiling of metabolites
Metabolomics requires technologies capable of measuring a wide range
of metabolites simultaneously. The two most commonly used methods
for this purpose are nuclear magnetic resonance (NMR) spectroscopy and
mass spectrometry (MS). Both techniques generate spectra, in which peak
positions and intensities correspond to specific metabolites. However, NMR
and MS operate on different principles, with each offering distinct advantages and limitations, making them complementary tools in metabolomics
research 119 .
NMR detects and quantifies metabolites based on magnetic properties of
protons. It provides high reproducibility, requires minimal sample preparation, and preserves sample integrity. NMR is inherently quantitative,
allowing metabolite concentrations to be measured directly from signal
intensity without external calibration. However, its sensitivity is low, making it more suitable for detecting metabolites present in medium to high
concentrations 120 . In contrast, MS identifies metabolites based on their
mass-to-charge ratio after ionization. It offers much higher sensitivity and
can also detect low-abundance metabolites. However, MS has difficulty
distinguishing between metabolites with the same mass, and incurs higher
costs due to the need for reference substances. Moreover, MS often faces
issues with instrument consistency and signal drift, leading to lower repeatability 121,122 . Such variability can pose challenges in applications like
disease risk prediction, where detecting subtle differences in metabolite
levels among healthy individuals is important for accurately distinguishing
risk profiles. Therefore, while MS can measure the broadest of metabolites,
NMR is appealing in terms of reproducible quantification of abundant
circulating metabolites at scale and at a relatively low cost.
Publications II-IV of this dissertation utilize quantitative metabolomics
data generated by the Nightingale Health NMR platform 28,31 . This plat-
35
Background
form simultaneously quantifies 168 metabolomic biomarkers and 81 ratios
of these, including lipoprotein lipids, fatty acids, ketone bodies, amino
acids, and other low-molecular-weight metabolic biomarkers, as well as
lipoprotein subclass composition.
2.3.3
Prior research in metabolomics and disease risk
Advances in metabolomic profiling technologies over the past two decades
have led to a substantial increase in metabolomics research, expanding
our understanding of the metabolomic complexity underlying health and
disease. Due to the vast number of studies in this field, this section focuses
on examples of human studies and findings related to blood metabolomics
and incident disease risk. However, metabolomics research encompasses a
variety of other study types, including investigations into dietary impacts
on metabolism 114,123 , elucidation of the genetic determinants of metabolite
levels 124–127 , and assessments of causal relationships through Mendelian
randomization studies 128–130 . However, such studies are beyond the scope
of this motivating review.
Initial applications of metabolomic profiling were seen in small cohorts,
typically involving a few hundred to a few thousand participants, often in
case-control settings, and utilized both MS and NMR-based assays. Despite
limited sample sizes, these first prospective cohort studies established key
associations, such as linking metabolites like branched-chain and aromatic
amino acids to the risk of type 2 diabetes 131–133 . Similarly, certain fatty
acids and lipid biomarkers were associated with cardiovascular events,
although findings were less consistent 134–137 . These initial investigations
revealed the role of metabolic alterations in cardiometabolic diseases and
highlighted the need for larger studies to enhance generalizability and
statistical power.
Since 2015, advances in metabolomic profiling platforms, particularly
those based on NMR, have enabled metabolomic profiling in large epidemiological cohort studies 28,31 . For example, Ahola-Olli et al. confirmed
the association of branched-chain and aromatic amino acids with type
2 diabetes risk in a cohort of 11,896 young adults. They also identified
additional biomarkers, including various fatty acids and lipoprotein lipids,
linked to hyperglycemia and diabetes onset 138 . Similarly, Borges et al.
demonstrated the role of certain fatty acid biomarkers in cardiovascular
risk through a meta-analysis of 16,126 participants from six cohorts, revealing distinct association patterns for stroke and coronary heart disease 139 .
Other studies have also identified unique metabolomic signatures for different subtypes of cardiovascular disease, including stroke, coronary artery
disease, and peripheral artery disease 140,141 . These findings highlight the
complexity of the metabolomic contributions to these conditions.
While most research has focused on cardiometabolic diseases, individual
36
Background
metabolites have also been linked to other conditions. For instance, Fischer
et al. found four metabolomic biomarkers to be associated with all-cause
mortality risk in two cohorts totaling 17,345 individuals 142 . Similarly,
Ritchie et al. associated the biomarker glycoprotein acetyls (GlycA) with
chronic inflammation and the long-term risk of severe infection in three
population-based cohorts involving 11,825 participants 143 . Kettunen et al.
tested the associations of GlycA with the incidence of over 400 common
diseases in a cohort of 11,861 individuals, revealing a broad spectrum
of significant associations with various disease outcomes 144 . Tynkkynen
et al. and van der Lee et al. identified specific metabolites linked to
the risk of dementia, Alzheimer’s disease and cognitive decline 145,146 .
Additionally, individual metabolites have been associated with the risk of
various cancers, including breast 147 , colorectal 148 , and pancreatic 149,150
cancers.
As sample sizes grew, the focus shifted from associating individual
biomarkers with specific disease outcomes to appreciating the whole metabolomic profile as a broader reflection of disease risk. For instance, Deelen
et al. developed a prediction model for all-cause mortality using 14 metabolites identified from metabolomic profiles in a study involving 44,168 individuals from 12 cohorts 30 . This model demonstrated improved prediction
accuracy compared to standard risk factors for mortality, suggesting that
metabolomic profiles can reflect broader frailty and systemic risks associated with mortality. Similarly, Pietzner et al. examined multimorbidity,
the simultaneous presence of multiple chronic conditions, using MS-based
metabolomics profiles from 11,966 individuals. Their findings revealed
shared metabolite associations among multiple incident diseases, highlighting the systemic relevance of blood metabolites 29 .
While the potential of metabolomic biomarker data to inform disease
risks has become evident, the challenge remains in translating these
findings into practical tools for risk prediction. A key step forward will
be the expansion of metabolomics data within large population cohorts
that are linked to comprehensive health outcome data. Prior to the research presented in this dissertation, the largest risk prediction studies
in metabolomics have involved around 10,000 participants, with multicohort meta-analyses extending this to approximately 40,000. Publications
II-IV of this dissertation introduce and build upon a novel NMR metabolomics dataset from the UK Biobank, comprising metabolomic profiles
from over 100,000 individuals. This substantial increase in sample size provides a unique opportunity to assess the broader relevance of metabolomic
biomarkers in health and disease and to develop robust risk prediction
models.
Following the release of the UK Biobank metabolomics dataset and the
presentation of initial findings in Publications II-III, several subsequent
studies have leveraged this resource to further explore the role of meta-
37
Background
bolomics in disease risk prediction. For instance, Buergel et al. 151 trained
deep learning models to predict common multi-disease outcomes based on
the metabolomic profiles from UK Biobank. Their models stratified risk
across a wide range of conditions and improved risk prediction beyond
established clinical risk factors in predicting diseases such as type 2 diabetes, dementia, and heart failure. Other prediction studies have focused
on specific disease outcomes, demonstrating the role of metabolomics in
the risk of type 2 diabetes 152 , dementia 153,154 , heart failure 155 , hepatocellular carcinoma and chronic liver disease 156 , Parkinson’s disease 157 and
aging 158 , among others. Collectively, these studies highlight the growing
research interest in using metabolomics for risk prediction, with promising
implications for precision medicine.
2.4
Treatment strategies in precision medicine
Beyond disease risk prediction, another central aspect of precision medicine
relates to optimizing treatments by considering the unique variability
among patients.
2.4.1
Drugs and drug targets
Drugs are chemical entities designed to alter biological processes within
the body, primarily with the goal of treating, curing, or preventing diseases.
The effectiveness of a drug is fundamentally linked to its ability to interact
with specific molecules in the body, drug targets. These targets are typically
proteins, such as enzymes, receptors, or ion channels, that are intrinsically
linked to the pathophysiology of diseases. For instance, statins, a widely
utilized class of cholesterol-lowering agents, exert their effects by inhibiting
HMG-CoA reductase, an enzyme crucial to cholesterol synthesis in the
liver 159 .
The specificity with which a drug engages its target is often considered
crucial for its therapeutic efficacy and safety, as drugs that precisely interact with their intended targets can theoretically produce strong therapeutic
effects while minimizing side effects 160 . However, the inherent complexity
of human biology and disease frequently challenges this ideal. In practice,
drugs with broad, multi-target effects can sometimes be more effective or
safer than highly selective agents 161–163 . Consequently, multi-targeted
monotherapies and drug combinations are increasingly recognized as valuable strategies for managing complex diseases, such as cancers 49,162 .
38
Background
2.4.2
Drug combination treatments
Drug combination treatments involve the simultaneous use of two or more
therapeutic agents to target multiple pathways or mechanisms associated
with a disease. This approach has become increasingly important in
managing complex conditions like cancer and infectious diseases, where
single-agent therapies often fall short due to the complex pathophysiology
and adaptive nature of these diseases 49,164–166 .
Drug combinations can enhance therapeutic efficacy through diverse
mechanisms. For instance, in cancer treatment, combining cytotoxic agents
with targeted therapies can improve outcomes by simultaneously inducing
direct tumor cell death while inhibiting specific signaling pathways that
contribute to cell proliferation. A classic example is the combination of
chemotherapeutic agents, such as cisplatin or paclitaxel, with targeted
therapies like bevacizumab, an anti-angiogenic drug. This combination
approach inhibits cell division while concurrently interfering with the
vascular supply of the tumour, resulting in improved survival rates for patients with specific cancer types, such as non-small cell lung cancer 167,168 .
Moreover, drug combinations can help mitigate the emergence of drug resistance, a major challenge in cancer treatment 46–48 . Resistance frequently
develops as cancer cells adapt to the selective pressure of single-agent therapies by activating alternative pathways to sustain growth. Combination
therapies, which simultaneously target multiple pathways or mechanisms,
can disrupt various aspects of cancer signaling networks, complicating
the tumor cells’ ability to adapt and evade treatment. For instance, in
melanoma patients with BRAF V600E mutations, combining MEK and
BRAF inhibitors has been shown to reduce resistance and improve overall
treatment efficacy compared to targeting either pathway alone 169 .
An additional benefit of using drug combinations is the potential reduction in toxicity associated with high doses of single agents. By administering lower doses of each drug in combination, the overall toxicity can
be minimized, which in turn can improve the overall safety profile of the
treatment 170,171 .
2.4.3
Quantifying drug combination effects
The sheer number of potential drug combinations far surpasses what can
be evaluated in clinical trials. Consequently, researchers often employ
high-throughput screening techniques to systematically assess the effects
of various drug combinations in pre-clinical models, such as established
cell lines and those derived from patient samples. The primary goal of highthroughput screening is to identify drug combinations that demonstrate
a synergistic effect, in which the combined effect exceeds what would be
expected from the individual effects. Synergistic combinations uncovered
39
Background
a
b
c
%-growth
through high-throughput screening can then be prioritized for further
pre-clinical and clinical evaluation.
Cancer cell lines are essential models for studying the effects of drug
combinations. Derived from various tumor types, they capture a wide
range of genetic and phenotypic diversity, allowing researchers to assess
how different drug combinations perform across multiple cancer types and
genetic backgrounds. While these cultured cells may not fully replicate
the complexity of tumors in patients, they offer a valuable experimental
platform for investigating cancer biology and evaluating the therapeutic
efficacy of anticancer compounds in high-throughput experiments 172–174 .
In the high-throughput screening of drug combinations, researchers
frequently employ large-scale dose-response matrix experiments to analyze
drug combination effects. In these experiments, cell lines are exposed to
various concentration pairs of two drugs, along with each drug individually,
to measure their responses. The resulting dose-response matrix displays
the effects for every concentration pair, which can be directly compared to
the monotherapy effects along the matrix edges (Figure 2.3). The responses
are typically quantified using metrics such as the percentage of cell growth
inhibition, survival, or death relative to a control.
Analyzing data from dose-response matrices allows researchers to categorize the effects of drug combinations as synergistic, antagonistic, or
Figure 2.3. An illustrative example of a 7 × 7 dose-response matrix design used
to quantify the effects of drug combinations. In this matrix, the rows
and columns represent varying concentrations of two drugs, D 1 and
D 2 , respectively. Each entry within the matrix indicates the observed
combination effect, such as measured by percentage growth of the
cell line. The monotherapy effects of each drug are represented along
the edges of the matrix. By comparing the observed combination
effects to the expected effects calculated using a reference model
of non-interaction, the drug combinations can be categorized as a)
additive, b) synergistic, or c) antagonistic based on deviations from
these expected values. Reprinted and modi ed from the author s
Master s thesis.
40
Background
additive. This classification is based on how the observed response to the
combination deviates from the expected response, which is calculated using
a reference model of non-interaction. Synergism occurs when the observed
effect of the drug combination exceeds the expected effect derived from responses to the individual drugs, while antagonism refers to a combination
effect that is less than expected. Commonly used models for quantifying
drug combination synergy include the highest single agent (HSA) model 175 ,
the Bliss independence model 176 , and the Loewe additivity model 177 , each
differing in their underlying assumptions.
2.4.4
Prior research in predicting drug combination effects
Despite advancements in high-throughput screening technologies, systematically testing every possible drug combination is impractical, as the
number of potential drug and dose combinations increases exponentially
with the number of considered drug components and their respective doses.
This presents a particular challenge in cancer treatment, where the molecular heterogeneity of cancer necessitates highly individualized therapeutic
strategies. Over the past decade, a diverse array of computational approaches have been developed to identify promising drug combinations
for subsequent experimental validation. 58,59,178 . This section provides an
overview and examples of computational methods used for predicting drug
combination effects.
Before the widespread adoption of machine learning, other computational methods were commonly used to predict drug combination effects.
These methods have remained useful, especially when training data is limited. These include, for instance, mathematical models, stochastic search
algorithms, and systems biology approaches 178 . Mathematical models
explicitly model drug combination response dynamics based on factors
like transcriptomic changes induced by the drugs 179,180 . Stochastic search
algorithms iteratively explore combinations to optimize synergy 181,182 .
Systems biology approaches model interactions within biological networks
to predict synergistic drug effects 183–185 . While these methods can offer
valuable insights, their effectiveness often depends on the accuracy of
underlying assumptions and the availability of detailed biological knowledge 178 .
More recently, machine learning techniques have gained popularity in
drug combination prediction, driven by the increasing availability of data
from high-throughput drug combination screens. These methods vary in
how they approach the prediction task and the types of input features and
training data used. Most models are based on supervised learning, where
the algorithm learns to predict outcomes from labeled training data. In
this context, the problem is typically framed as either a classification or
regression task. Classification models aim to categorize drug combinations
41
Background
as additive, synergistic, or antagonistic, while regression models aim to predict continuous outcomes, such as synergy scores or dose-response values.
Input features in these models range from drug-specific characteristics like
chemical structures (e.g., molecular fingerprints or graph-based features)
and physicochemical properties, to biological data from the target system,
such as gene expression profiles or copy number variations in cancer cell
lines. Training data usually comes from large-scale experimental drug combination screens, with widely used datasets including NCI-ALMANAC 53 ,
DrugComb 186 , SYNERGxDB 187 , and O’Neil et al. study 188 .
Various machine learning algorithms have been employed for drug
combination prediction, including support vector machines 189 , random
forests 190,191 , extreme gradient boosting machines (XGBoost) 192,193 and
logistic regression 194 . Among these, popular ensemble tree-based models,
such as random forests and XGBoost, have demonstrated strong predictive performance and are frequently applied in this context. For example,
Sidorov et al. applied random forests and XGBoost to predict drug combination synergy. Their study developed separate models for each cell
line, training the models to predict synergy scores from a large publicly
available database based on various chemical features 192 . Random forests
were also used in a study by Li et al. 190 , which incorporated a variety of
features such as biomarkers and genetic data, including gene expression,
copy number variation and methylation data, to predict drug combination
synergy. Celebi et al. also used XGBoost to predict drug combination synergies using multi-omics features 193 . However, similar to neural networks,
such models can be computationally intensive to train and challenging to
interpret.
Deep learning, a field of machine learning that uses multi-layered neural
networks to capture complex patterns in data, has become increasingly
popular for predicting drug combinations in recent years 195–201 . For instance, a study by Preuer et al. was among the first to apply deep learning
to this task, utilizing both compound information and genomic data from
cancer cell lines to predict drug combination synergy 202 . However, despite
its potential, some studies have shown that deep learning does not consistently outperform other methods in predictive accuracy and can sometimes
struggle with generalization to unseen drugs 58,203–205 .
To objectively evaluate and advance computational methods for drug combination prediction, two crowdsourced DREAM (Dialogue for Reverse Engineering Assessment and Methods) challenges have been conducted 206,207 .
The latest, the AstraZeneca-Sanger Drug Combination DREAM Challenge,
evaluated 160 prediction methods on their ability to predict and classify
drug synergy using blinded data from 910 drug combinations across 85
cancer cell lines 206 . The top-performing method utilized a random forest
algorithm, integrating monotherapy data, drug target information, molecular features, and gene–gene interaction networks to predict synergy 190 .
42
Background
The results from these challenges have shown that the selective incorporation of biological knowledge, rather than the choice of computational
method alone, is key to achieving high predictive performance 206,208 .
Despite the notable advancements in machine learning approaches for
drug combination prediction, at the time of Publication I, a key limitation that remained largely unaddressed was the dose-dependence of drug
combination effects. Most available methods at the time focused on predicting a single synergy score or categorizing combinations as synergistic,
additive, or antagonistic, without accounting for how these effects vary
with different doses. In practice, drug combinations can exhibit different
effects across dose levels; a combination that is synergistic at one dose may
be antagonistic at another. This highlighted the need for computational
models capable of providing detailed, dose-specific predictions.
2.5 Machine learning and statistical inference
Machine learning and statistical inference represent two related domains
within data science, both dedicated to deriving meaningful insights from
data. However, they differ in their primary objectives. Statistical inference focuses on drawing conclusions about a population from sample data
by estimating model parameters and testing hypotheses, while machine
learning centers on prediction and developing models that learn from data
to generalize to new, unseen instances.
I will use the following notation throughout this dissertation. Each
observation consists of a vector of predictor variables x 2 Rd (also known
as features or covariates) and an outcome variable y. The outcome can
be continuous y 2 R (regression), binary y 2 (0, 1) (classification) or time-toevent y = {T 2 R+ , ± 2 (0, 1)} (survival analysis). The task is to estimate a
function ŷ to make a good prediction of the output y, given the predictors
x. To construct the prediction rules in a supervised manner, we use a
set of measurements {(x1 , y1 ), (x2 , y2 )..., (xn , yn )}, commonly referred to as the
training data, or simply as the sample in the context of statistical inference.
Depending on the specific goals of the analysis, the learned prediction rules
can be used for making inferences about the underlying data-generating
process or to make predictions on unseen instances.
Tensors are represented using bold calligraphic letters X , matrices using
bold uppercase letters X, vectors using bold lowercase letters x, and scalars
using regular lowercase letters x. The i th row of a matrix X is denoted as
x i . All vectors are considered as column vectors; thus, for a vector a 2 Rn ,
a> a is a scalar and aa> is n £ n matrix, where > denotes the transpose
operator. An inner product between two vectors a and b is expressed as
ha, bi = a> b. I use the term interaction to refer to a statistical interaction
term between two predictor variables, as commonly used in biomedical and
43
Background
epidemiological research. When referring to interactions involving three or
more variables, I specifically refer to these as higher-order interactions or
by their order, such as a third-order interaction for interactions involving
three variables.
2.5.1
Multiple regression
Multiple regression encompasses a set of statistical techniques used to
predict an outcome, or dependent variable y, by several independent or
predictor variables x = (x1 , x2 , x3 , ..., xd ). Depending on the type of the outcome variable, commonly applied multiple regression techniques include
linear regression (for continuous outcomes), logistic regression (for binary
outcomes), and Cox proportional hazards regression (for time-to-event
outcomes).
These regression methods can be used for both purposes of statistical inference and prediction. When using these methods for statistical inference,
the focus is on understanding the magnitude and direction of associations
between the predictors and the outcome. In contrast, when using them
as machine learning tools for building prediction models, the focus is on
maximizing the accuracy of the model in predicting the outcome for new
instances. Often, both objectives can be pursued simultaneously, allowing
for the development of prediction models that also provide insights into
the factors underlying the prediction.
The most common multiple regression methods are based on a linear
model, which assumes that the outcome is related to the predictors through
a linear combination:
g(ŷ(x)) = Ø0 + Ø1 x1 + Ø2 x2 + . . . + Ød xd
(2.3)
where g(·) represents a link function that connects the linear predictor
to the outcome. The regression coefficients Ø i quantify the influence of
each predictor on the outcome and Ø0 is an intercept term, also known as
the bias term. The choice of link function varies depending on the type of
regression model (Table 2.1).
To express the linear regression model in vector form, it is convenient
to include a constant variable 1 in the vector of predictor variables x =
Method
Link function
Linear regression
ŷ(x) = Ø> x
Logistic regression
Cox proportional hazards regression
log( 1°PP( y(=y=1|1x)|x) ) = Ø> x
log(h(t|x)) = log(h 0 (t)) + Ø> x
Table 2.1. Variations of multiple regression models.
44
Background
(1, x1 , x2 , ..., xd ) to account for the intercept term. The linear model can then
be compactly written as an inner product:
g(ŷ(x)) = Ø> x
(2.4)
where Ø = (Ø0 , Ø1 , Ø2 , ..., Ød )> is a vector containing the regression coefficients. The term Ø> x is commonly referred to as the linear predictor,
representing the linear combination of the predictor variables x weighted
by their respective coefficients Ø. This vector notation will be used throughout the following sections.
In the field of statistical epidemiology and disease risk modeling, multiple
regression techniques are widely recognized as standard methods due to
their interpretability and ability to provide insights into the relationship
between predictors and outcomes. In Publications II-IV of this dissertation,
these models were employed to quantify associations between metabolomic
biomarkers and disease risk, as well as to derive risk prediction models.
Given the novelty of the UK Biobank metabolomics dataset, it was important to use these established regression techniques to provide a reliable
reference point for future studies and ensure that the results could be
meaningfully compared to existing research in the field.
Logistic regression
Logistic regression is a widely used form of multiple regression for modeling binary outcome variables y 2 {0, 1}. In statistical epidemiology, it is
commonly used for modelling outcomes where time is not relevant, such as
prevalent disease outcomes.
Logistic regression uses the logistic function æ : R ! {0, 1}, æ(t) = 1+1e°t to
convert a linear combination of input predictors into a probability ranging
from 0 to 1:
ŷ(x) = P(y = 1|x) =
1
>
1 + e°Ø x
(2.5)
where x is the vector of predictor variables, including a constant term for
the intercept, and Ø is the vector of regression coefficients.
Logistic regression establishes a linear relationship between the predictors and the log odds of the outcome through the logit function, which is
the inverse of the logistic function:
✓
◆
ŷ(x)
logit(ŷ(x)) = log
= Ø> x
(2.6)
1 ° ŷ(x)
Thus, logistic regression assumes that while the outcome itself is binary,
the log odds of the outcome (i.e., the logit) has a linear relationship with
the predictor variables.
The parameters Ø in logistic regression are estimated using maximum
likelihood estimation (MLE). As there is no closed-form solution, MLE
45
Background
seeks to determine the parameter values that maximize the likelihood
of observing the given data under the logistic regression model. The
likelihood function L(Ø) is given by:
L(Ø) =
n
Y
i =1
ŷ i (x) yi (1 ° ŷ i (x))1° yi
(2.7)
where ŷ i (x) is the predicted probability for individual or observation i (eq.
(2.5)). The log-likelihood function, which provides an equivalent but more
convenient form for optimization, is obtained by taking the logarithm of
the likelihood function:
`(Ø) = log(L(Ø)) =
n
X
⇥
i =1
⇤
yi log(ŷ i (x)) + (1 ° yi ) log(1 ° ŷ i (x))
(2.8)
The MLE estimates of Ø are obtained by maximizing this log-likelihood
function, typically using optimization methods such as Newton-Raphson
or gradient descent.
Cox proportional hazards regression
Cox proportional hazards regression 32 , or simply Cox regression, is a
widely used method in survival analysis for modeling time-to-event outcomes, such as the onset of a disease. The outcome in Cox regression
consists of the time to event T 2 R+ and an event indicator ± 2 0, 1, where
± = 1 indicates the event has occurred, and ± = 0 indicates censoring, i.e.
the event has not occurred by the end of the observation period.
Unlike logistic regression, which models the probability of an event, Cox
regression models the hazard function h(t|x), representing the instantaneous risk of the event at time t, conditional on survival up to that time.
The Cox model is expressed as:
h(t|x) = h 0 (t) exp(Ø> x)
(2.9)
where h0 (t) is the baseline hazard function, which varies over time but is
independent of the predictors, and exp(Ø> x) is the partial hazard reflecting
the effect of the predictor variables on the baseline hazard. This formulation allows the model to separate the time-dependent baseline hazard from
the influence of the predictors. The term "proportional hazards" reflects the
assumption that the predictors affect the hazard rate in a multiplicative
manner, maintaining proportionality over time across individuals.
Taking the logarithm of the hazard function reveals the linear relationship of the predictor variables:
log(h(t|x)) = log(h 0 (t)) + Ø> x.
(2.10)
The parameters Ø are estimated using partial likelihood, as the full likelihood is complicated by the baseline hazard function h0 (t), which typically
46
Background
remains unspecified. The partial likelihood focuses on the ordering of
events rather than their exact timing, allowing for the estimation of the
coefficients in a semi-parametric fashion without the need to specify the
baseline hazard h0 (t):
L(Ø) =
Y
i :± i =1
P
h 0 (t) exp(Ø> x)
j 2R (T i ) h 0 (t) exp(Ø
>
x)
=
Y
i :± i =1
P
exp(Ø> x)
j 2R (T i ) exp(Ø
>
x)
(2.11)
where x i is the vector of predictor variables for individual i , T i is their
observed event time, and R(T i ) denotes the risk set at time T i , consisting
of individuals who are still at risk before the i th event.
To obtain a more convenient formulation, the log-partial likelihood function is often used:
!!
X
X
>
>
(2.12)
l(Ø) = log(L(Ø)) =
Ø x i ° log
exp(Ø x j )
j 2R (T i )
i :± i =1
The maximum partial likelihood estimation approach involves finding
the parameter values Ø that maximize this log-partial likelihood function.
As no closed-form solutions exists, the optimization is typically performed
using methods such as Newton-Raphson or gradient descent.
Statistical inference and interpretation
In both logistic and Cox regression, the estimated coefficients Ø = (Ø1 , Ø2 , ..., Ød )>
can be used for statistical inference and interpretation:
• Logistic regression: The coefficients Ø i reflect the change in the log
odds of the outcome associated with a one-unit increase in the predictor,
assuming all other variables remain constant. When exponentiated, eØi
represents the odds ratios, indicating the multiplicative change in the
odds of the outcome for each one-unit increase in the predictor.
• Cox proportional hazards regression: The coefficients Ø i reflect the
change in the log hazard associated with a one-unit increase in the
predictor variable, assuming all other variables remain constant. When
exponentiated, eØi represents the hazard ratios, indicating the multiplicative change in the hazard for each one-unit increase in the predictor.
To test the null hypothesis H0 : Ø i = 0 (indicating no effect of the predictor
x i on the outcome), the Wald test 209 is typically used in both logistic and
Cox regression. the Wald statistic W is calculated as:
W=
✓
Ø̂ i
se(Ø̂ i )
◆2
(2.13)
47
Background
where Ø̂ i is the estimated coefficient and se(Ø̂ i ) is its standard error.
Under the null hypothesis, W follows a ¬2 distribution with 1 degree of
freedom. For practical inference, the square root of W is typically used
to obtain a z-statistic, which is approximately normal. The 100(1 ° Æ)%
confidence interval for each estimated regression coefficient is given by:
Ø̂ i ± zÆ/2 · se(Ø̂ i )
(2.14)
where zÆ/2 is the critical value from the standard normal distribution.
This facilitates hypothesis testing for the significance of predictors and
interpreting their impact on the outcome.
When interpreting the association results, potential sources of bias
such as confounding, information bias, and selection bias must be considered 210,211 . Confounding arises when the association between a predictor
and an outcome is distorted by a third variable, a confounder, which is
associated with both the predictor and the outcome. Confounding can often
be corrected for by including the confounder as a covariate in the model or
through stratification. Information bias results from measurement errors
in predictor or outcome variables, while selection bias can occur if the
study sample is not representative of the target population. Such biases
may lead to inaccurate estimates and incorrect conclusions about the true
effect sizes. However, when the goal is to discover predictive associations
rather than to make causal inferences, these biases are less critical. In this
context, the presence of confounders or other biases does not necessarily
undermine the model’s utility, provided it reliably predicts the outcome of
interest 212 .
Regularization
When using multiple regression models for prediction, regularization becomes a pivotal technique to address the risk of overfitting. Overfitting
arises when a model is excessively complex, capturing noise in the training
data rather than the underlying patterns, which can hinder its performance on new data. Regularization addresses this issue by incorporating a
penalty term to the objective function, constraining the magnitude of the regression coefficients Ø. This process results in simpler, more parsimonious
models that improve the generalizability to unseen data.
In the context of logistic and Cox proportional hazards regression, estimation of the regression coefficients Ø involves maximizing the log-likelihood
or log-partial likelihood functions, respectively. This is equivalent to minimizing the negative log-(partial) likelihood °l(Ø) . Regularization modifies
this objective function by adding a penalty term P(Ø), which imposes a
constraint on the magnitude of the coefficients:
Ø̂ = arg min ° l(Ø) + ∏P(Ø)
Ø
48
(2.15)
Background
where ∏ ∏ 0 is a regularization parameter that determines the strength
of the penalty. The penalty term can take different forms depending on the
type of regularization. For instance, ridge regression 213 involves the use
P
of L2 regularizer P(Ø) = ||Ø||22 = di=1 Ø2i , while LASSO 214 (least absolute
shrinkage and selection operator) regression incorporates L1 regularizer
Pd
P(Ø) = kØk1 = i=1 |Ø i |.
The regularization parameter ∏ controls the trade-off between fitting the
model to the training data and imposing a penalty on the complexity of the
model. When ∏ = 0, the model reduces to the unregularized form, where
the penalty has no influence. As ∏ increases, the penalty term exerts a
stronger influence, resulting in greater shrinkage of the coefficients, which
can reduce variance but potentially increase bias, commonly referred to as
the bias-variance trade-off.
Interactions
Interactions in statistical modelling occur when the effect of one predictor
on the outcome depends on the level of another predictor (Figure 2.4). This
implies that the predictors jointly influence the outcome in a way that is
not simply additive. Including interaction terms in a regression model
allows for the exploration of more complex relationships and can provide
insights into how different factors jointly affect the outcome. For instance,
the biological processes giving rise to disease are often inherently complex,
with interactions representing one form this complexity.
Recognizing and appropriately modeling interactions can enhance both
the predictive accuracy and interpretability of a regression model. In
predictive modeling, accounting for interactions can lead to more accurate
predictions by capturing complex relationships between predictors that
would otherwise be missed if only main effects were considered. For
example, in many machine learning models, interactions between variables
are often implicitly captured through algorithms such as random forests
or neural networks. However, in linear models, interactions need to be
explicitly specified.
To model interactions in a regression model, an interaction term is
created by multiplying the two predictors of interest. The regression
equation then includes this interaction term in addition to the main effects
of the individual predictors:
g(·) = Ø0 + Ø1 x1 + Ø2 x2 + Ø12 x1 x2 + ...
(2.16)
where g(·) is a link function appropriate for the model (e.g. logit for
logistic regression, identity for linear regression) and Ø0 is the intercept.
x1 and x2 are the predictor variables of interest, with Ø1 and Ø2 giving their
respective main effects on the outcome. x1 x2 is the interaction term and
Ø12 represents the effect of the interaction on the outcome. The model can
be extended to include additional predictors and their interactions.
49
Background
a
No interaction
b
Positive interaction
x2
c
x2
0
1
2
0
1
2
Negative interaction
3
x2
0
1
2
y
y
y
1
0
0
0
−1
−2
−2
−2
−2
0
x1
2
−2
0
x1
2
−2
0
x1
2
Figure 2.4. Illustrative examples of different data scenarios depicting interaction
effects in the context of a continuous outcome variable y (y-axis). a)
The continuous predictor x1 (x-axis) and binary predictor x2 (colorcoded) both exhibit positive main effects on y, but no interaction effect
is present. b) Similar to a), but with an additional positive interaction
effect between x1 and x2 on y. c) Similar to a), but with an additional
negative interaction effect between x1 and x2 on y.
Modeling interactions becomes increasingly complicated as the number
of predictors increases, posing a particular challenge in prediction models
involving many variables. As the number of possible interaction terms
increases quadratically with the number of predictors, include all potential interactions quickly becomes both computationally and statistically
challenging. This challenge motivated the work in Publication V, where we
developed an extension to the Cox proportional hazards regression model
to effectively model interactions among predictor variables. This approach
was developed based on concepts derived from factorization machines, a
machine learning technique presented in the following section.
2.5.2
Factorization machines
Factorization machines (FMs) provide a non-linear extension to standard
multiple regression techniques that can capture comprehensive interactions among predictor variables. They were introduced by Steffen Rendle
in 2010 215 as a method to generalize matrix factorization and regression
models by efficiently modeling interactions in high-dimensional datasets.
Initially, FMs were applied to recommender systems and click-through
rate prediction, effectively modeling sparse and high-dimensional data
common in these domains 215–219 .
Standard formulation of factorization machines
The central idea behind FMs is to model the interactions between pairs of
variables using inner products of low-dimensional latent vectors (Figure
2.5a). In this framework, each predictor variable is associated with a
50
Background
corresponding latent vector. The interaction between any two variables is
then represented by the inner product of their latent vectors. This approach
enables FMs to capture complex interaction patterns without explicitly
enumerating all possible interactions, which would be computationally
and statistically challenging in high-dimensional settings. Mathematically,
the FM model can be expressed as:
ŷ(x) = Ø> x +
X
1∑ i 6= j ∑ d
hp i , p j i x i x j
(2.17)
where h·, ·i denotes the inner product. The first term represents the linear
effects of the predictor variables, analogous to standard multiple regression
models, where Ø contains the regression coefficients for the predictors x.
The second term accounts for all potential pairwise interactions between
the predictor variables x i and x j . Instead of estimating these interaction
effects Ø i j directly, FMs approximate them using the inner product of two
latent vectors p i and p j :
Ø i j º hp i , p j i =
k
X
f =1
pi f · p j f
(2.18)
where p i 2 Rk represents the contribution of variable x i to k latent factors.
The collection of these latent vectors forms the parameter matrix P 2 Rd £k .
The rank k is a hyperparameter that defines the dimensionality of the
latent factor space, typically chosen such that k ø d , thereby reducing the
number of interaction parameters to estimate from d 2 to dk (Figure 2.5a).
This methodology has two key advantages. First, estimation of the interaction parameters, which would be quadratic in the number of variables
if modeled explicitly, is effectively reduced in complexity due to the use of
low-dimensional latent vectors. This enables the estimation of interaction
effects even in scenarios where explicit modeling would be infeasible due to
the quadratic increase of potential interactions. Second, the methodology
allows for reliable parameter estimation even in cases of sparse data. This
means that co-occurrence of predictors x i and x j does not need to be directly
observed to learn the interaction effect Ø i j ; the latent vectors p i and p j
can be learned through interactions with other predictors and the inner
product still gives Ø i j . The latter feature proves particularly valuable in
predicting drug combination responses in Publication I, where it facilitates inferences about the effects of new drug combinations based on prior
observations of their individual components elsewhere in the training data.
FMs are applicable to a wide range of prediction tasks, including both
regression and classification. The choice of the loss function, which quantifies the error between predicted and actual values, depends on the specific
task at hand. Given a loss function L and L2 (ridge) regularization, the
regularized objective function for FMs can be expressed as:
51
Background
n
arg min
Ø,P
1X
L (yi , ŷ(x i )) + ∏1 ||Ø||22 + ∏2 ||P||22
n
(2.19)
i =1
where Ø1 > 0 and Ø2 > 0 are regularization parameters for the linear
effects Ø and the matrix of factor vectors P giving the effects for the interaction terms, respectively. The model parameters are typically estimated
using gradient-based methods, such as stochastic gradient descent (SGD).
Factorization machines were initially developed for regression and classification tasks, using loss functions like a squared loss for regression or logit
or hinge loss for classification. In Publication V, we extend factorization
machines to the context of survival analysis and show their ability to improve disease risk prediction compared to standard linear Cox proportional
hazards regression model.
Higher-order factorization machines
Higher-order factorization machines (HOFMs) extend the standard factorization machines (FMs) by capturing interactions beyond pairwise relationships 62,215 . While standard FMs are limited to modeling individual
effects and pairwise interactions among predictor variables, HOFMs are
designed to also account for higher-order interactions involving three or
more predictors.
The core idea of HOFMs is to decompose interactions among three or
more variables into products of their respective latent vectors, similar to
the approach used in standard FMs for pairwise interactions (Figure 2.5b).
Mathematically, the model for an HOFM of order m can be expressed as:
ŷ(x) = Ø> x +
X
(2)
hp(2)
i , p j i x i x j + ... +
1∑ i 6= j ∑ d
X
) ( m)
)
hp(im
, p i 2 , ..., p(im
i x i 1 x i 2 ...x i m
1
m
(2.20)
1∑ i 1 <...< i m ∑ d
where the first term corresponds to the linear effects of the predictor
variables, analogous to standard multiple regression models, where Ø
contains the regression coefficients for the predictors x. The second and
higher-order interaction effects are estimated in a factorized form:
Ø i j º hp i , p j i
Ø i 1 ,i 2 ,...,i t º hp(it1) , p(it2) , ..., p(itt) i
(2.21)
(2.22)
where p(it) 2 Rk t denotes the tth order factor weight of feature i and
k t 2 N+ is the hyperparameter defining the rank of the factorization. The
generalized inner product h·, ..., ·i over t vectors u i 2 Rk , i = 1, ..., t is defined
as:
k
X
hu1 , u2 , ..., u t i =
u 1 f u 2 f ...u t f
(2.23)
f =1
52
Background
which generalizes the usual pairwise inner product hu, vi = u> v to sets of t
vectors.
The latent vectors for each interaction order t are collected into matrices
P( t) 2 Rd £k t . Hence, the number of parameters to estimate for each degree t
reduces from d t to dk t (Figure 2.5b). While the rank k t 2 N+ can be uniquely
set for each order t, it is often convenient in practice to use a uniform rank
k across all orders k 1 = k 2 = ... = k m .
Similar to standard FMs, HOFMs are applicable to various prediction
tasks, including regression and classification, depending on the choice of
the loss function L . The L2 regularized objective function for HOFMs can
be expressed as:
n
m
i =1
t=2
X
1X
L (yi , ŷ(x i )) + ∏1 ||Ø||22 +
∏ t ||P( t) ||22
Ø,P(1) ,P(2) ,...,P(m) n
arg min
(2.24)
where Ø1 , Ø2 , ..., Øm > 0 are regularization parameters.
Despite the conceptual higher-order extension presented in the original
work by Rendle 215 , efficient estimation of HOFMs was only made feasible
with the dynamic programming approach proposed by Blondel et al. 62 .
This approach allowed for the efficient calculation of interaction terms and
their gradients, thus enabling the use of stochastic gradient algorithms for
training HOFMs, notably improving their computational feasibility.
Factorization machines and their higher-order extensions have demonstrated state-of-the-art performance in fields such as recommender systems
and click-through rate prediction for digital advertising 216–219 . However,
their application in other domains, particularly in biomedical research,
has been limited. In Publication I, we apply higher-order factorization
machine to model complex interactions in inherently high-dimensional
drug combination dose-response data and show their ability to perform
accurate predictions in various prediction scenarios.
53
Background
a
β ∈ Rd
2
P ∈ Rd×k
d k
d
pi
βi,j ≈ pi , pj βi,j
≈
.
k
d
b
B ∈ Rd
3
(3)
(3)
(3)
βi1 ,i2 ,i3 ≈ pi1 , pi2 , pi3 d
βi1,i2,i3
pj
d
≈
.
P(3) ∈ Rd×k
(3)
p i1
(3)
pi2
d
(3)
pi3
d
k
Figure 2.5. a) Standard factorization machines estimate all pairwise interaction effects among predictors using factorized parametrization β i j = ⟨p i , p j ⟩.
(b) Higher-order factorization machines extend the factorization to interactions of order t, β i 1 ,i 2 ,...,i t ≈ ⟨p(it1) , p(it2) , ..., p(itt) ⟩ (here, illustrated for
t = 3). d denotes the number of predictors and k is a hyperparameter
de ning the rank of the factorization. The rank of the factorization is
typically much lower than the number of predictor variables ( k d ),
enabling estimation of the interaction effects even in the presence of
many predictors.
54
3. Predictive modelling of drug
combination effects (Publication I)
Given the vast number of possible drug and dose combinations, computational methods are essential for guiding experimental efforts toward the
most promising candidates for further pre-clinical and clinical validation.
Although high-throughput screening has produced extensive datasets of
drug combination responses, notable gaps persist in the combinatorial
drug space. In Publication I, we introduced comboFM, a novel machine
learning framework that models the dose-specific effects of drug combinations. Unlike previously proposed machine learning approaches that
primarily focused on synergy, comboFM provides detailed predictions of
drug responses across a range of doses.
3.1
Foundations of comboFM
comboFM builds upon representing the dose-specific drug combination
response data as a higher-order tensor X , indexed by drugs, their concentrations, and cell lines (Figure 3.1). Additionally, comboFM allows for the
integration of auxiliary data such as chemical features of the drugs, or
genomic or transcriptomic features of the cell lines.
comboFM employs higher-order factorization machines (HOFMs) 62 to
X,
Genomic descriptors
(e.g. gene expression)
Chemical descriptors
(e.g. fingerprint)
Cell lines
…,0,1,0,0,0,1,0,0,…
Drug 1
Chemical descriptors
(e.g. fingerprint)
Dr
ug
2
…,0,1,0,0,0,1,0,0,…
Drug 2 concentration
Drug 1 concentration
Figure 3.1. Dose response matrices from experimental measurements are arranged into a fth-order tensor X , indexed by drugs, concentrations,
and cell lines, along with additional descriptors. Reprinted from Publication I.
55
Predictive modelling of drug combination effects (Publication I)
model interactions within the tensor. The data tensor is flattened into a
matrix X, with each row vector x representing an entry from the tensor.
The method then estimates regression weights Ø for the individual predictors, along with weights for each higher-order feature interaction using
factorized parameterization Ø i 1 ,i 2 ,...,i t º hp(it1) , p(it2) , ..., p(itt) i (Section 2.5.2). This
approach avoids the computational and statistical difficulties related to
direct estimation of the weight tensor for the higher-order interactions B
(Figure 2.5b). Moreover, by coupling weights through latent factors, the
approach enables effective learning even in scenarios of sparsely populated
data tensors.
This approach of first predicting the entire dose–response matrices allows one to utilize the detailed data they contain. Subsequently, drug
combination synergy can be quantified over the full predicted matrix using
various models, providing a more comprehensive understanding of potentially synergistic effects. Furthermore, understanding the effects of drug
combinations at both the dose-response and synergy levels offers valuable
guidance for precision medicine efforts, as lower doses tend to be favored
in clinical settings due to better tolerability.
3.2
Drug combination dataset
comboFM was trained and tested using data from a cancer cell line pharmacogenomic screen from the NCI-ALMANAC study, the largest available
drug combination dataset at the time. This dataset comprises over 5000
combinations of around 100 FDA-approved oncology drugs tested against
60 cancer cell lines in several concentration pairs. To manage computational complexity, we considered a subset of this data containing 50
randomly chosen drugs. These drugs had been screened in 617 unique
combinations across all the 60 cell lines derived from 9 tissue types. This
subset included 333,180 measurements of drug combination responses and
222,120 monotherapy response measurements, recorded as the percentage
growth of the cell lines. As additional features for the model training, we
also incorporated information related to the molecular fingerprints of the
drug compounds and gene expression profiles of the cancer cell lines.
3.3
Evaluation settings
comboFM was tested in three practical prediction scenarios. The first scenario involved predicting missing entries in partially observed dose–response matrices. The second scenario focused on predicting entirely unmeasured dose–response matrices for known drug combinations in new cell
lines. The third and most difficult scenario, involved predicting responses
56
Predictive modelling of drug combination effects (Publication I)
for completely new drug combinations. To evaluate the performance of
comboFM in predicting drug combination responses and to tune model
hyperparameters, we implemented a 10×5 nested cross-validation procedure across all three scenarios. The interaction order was set to m = 5,
corresponding to the dimensionality of the underlying tensor.
To assess the potential advantages of incorporating higher-order interactions, we also conducted analyses with second-order FMs and first-order
FMs (equivalent to ridge regression). Additionally, to further evaluate the
predictive performance of comboFM, we applied random forest (RF) as
another reference model. RF is a commonly used machine learning model
that operates on a different learning principle. It has previously been succesfully applied for predicting drug combinations, including the winning
method in the recent AstraZeneca-Sanger drug combination prediction
DREAM Challenge 190,206 .
[D2 ]
c
x
x x x
x
[D2 ]
%-growth
%-growth
b
x x x
x x x
x x x
C
[D2 ]
%-growth
a
x x x
x x x
x x x
C
C
[D1 ]
[D1 ]
[D1 ]
Predicting new doseresponse matrix entries
Predicting new doseresponse matrices
Predicting new drug
combinations
Figure 3.2. Three prediction scenarios are considered: a) predicting missing
entries in partially tested dose response matrices, b) predicting a
complete dose response matrix in a new cell line, and c) making
predictions for a completely new drug combination not tested so far in
any cell line. Reprinted from Publication I.
3.4
Accurate predictions of drug combination effects
The 5th-order comboFM achieved the highest predictive accuracy across
the compared methods in all prediction scenarios, with Pearson correlations between predicted and observed dose-responses ranging from 0.95 to
0.97 (Figure 3.3). The predictive performance remained consistent across
different tissue types and drug classes, demonstrating its robustness across
various drug classes and biological contexts (Figure 3.4). Importantly, comboFM generalized well to new drug combinations, enabling systematic
prediction of dose–response matrices for previously untested combinations. This could provide practical guidance on repositioning the drugs into
new combinations. Furthermore, comboFM accurately recovered synergy
57
Predictive modelling of drug combination effects (Publication I)
scores from predicted dose–response matrices in all evaluated scenarios,
outperforming the compared methods.
3.5
Experimental validation of predicted drug combinations
In-house experimental laboratory validation confirmed the robustness of
comboFM’s predictions, even under different assay conditions from the
original NCI-ALMANAC dataset. For instance, the experimental validation verified a novel synergy between proteasome inhibitor bortezomib
and anaplastic lymphoma kinase inhibitor crizotinib in lymphoma cells,
as predicted by comboFM. Additionally, comboFM accurately predicted
the efficacy of combinations involving the histone deacetylase (HDAC) inhibitor romidepsin in targeting BRAF-mutant melanoma cell lines, which
was also confirmed experimentally. While many of the drugs in these
romidepsin-based combinations have previously been explored in other
combinations for melanoma treatment 220,221 , the specific combinations
predicted by comboFM remain untested in melanoma and require further
investigation. Each of these predicted inhibitors has shown promise individually against melanoma in pre-clinical or clinical studies, supporting
their potential use in combination therapies. This validation highlights
the generalizability and practical applicability of comboFM.
3.6
Conclusions
In conclusion, due to the high costs associated with experimental drug combination screening, comboFM presents a time- and cost-efficient solution
for prioritizing promising combinations for further pre-clinical and clinical investigation. Its robust and accurate predictions provide a valuable
tool to accelerate the development and extension of combination therapies
in precision oncology. This holds potential to advance the clinical use
of drug combination therapies to address drug resistance and increase
treatment efficacy. Several of the combinations predicted by comboFM are
already undergoing clinical trials, highlighting the method’s translational
potential.
58
Predictive modelling of drug combination effects (Publication I)
a
Predicting new dose−response matrix entries
RMSE : 9.86
RSpearman : 0.91
100
y = 3.4 + 0.95 x
0
−100
NCI ComboScores, RPearson : 0.92
−200
−200
b
RMSE : 10.91
RPearson : 0.97
Predicted response, %−growth
Predicted response, %−growth
200
−100
0
100
200
RPearson : 0.97
100
y = 6.8 + 0.91 x
RSpearman : 0.91
0
−100
−200
−100
0
100
Measured response, %−growth
Measured response, %−growth
comboFM−5
RF
200
Predicting new dose−response matrices
RMSE : 10.39
RMSE : 12.23
RPearson : 0.97
RSpearman : 0.91
100
Predicted response, %−growth
Predicted response, %−growth
200
y = 3.4 + 0.95 x
0
−100
NCI ComboScores, RPearson : 0.84
−200
−200
c
NCI ComboScores, RPearson : 0.83
−200
200
−100
0
100
200
RPearson : 0.96
100
y = 7.9 + 0.89 x
RSpearman : 0.90
0
−100
NCI ComboScores, RPearson : 0.69
−200
−200
200
−100
0
100
Measured response, %−growth
Measured response, %−growth
comboFM−5
RF
200
Predicting new drug combinations
RMSE : 13.04
RMSE : 15.44
RPearson : 0.95
RSpearman : 0.88
100
Predicted response, %−growth
Predicted response, %−growth
200
y = 5.1 + 0.93 x
0
−100
NCI ComboScores, RPearson : 0.72
−200
−200
−100
0
100
200
200
RPearson : 0.93
100
y = 12 + 0.84 x
RSpearman : 0.86
0
−100
NCI ComboScores, RPearson : 0.48
−200
−200
−100
0
100
Measured response, %−growth
Measured response, %−growth
comboFM−5
RF
200
Figure 3.3. Predictive performance of 5th order comboFM-5 (blue) and random forest (RF; orange) by comparing the measured and predicted
dose response values across the three prediction scenarios: a) predicting new entries in dose response matrices, b) predicting entire
dose response matrices, and c) predicting new drug combinations.
Performance metrics are reported as root mean squared error (RMSE),
Pearson correlation and Spearman correlation for the drug combination response predictions, along with Pearson correlations of the
synergy scores (NCI ComboScores) computed from the predicted
dose response matrices. Reprinted and modi ed from Publication I.
59
Predictive modelling of drug combination effects (Publication I)
a
d
Predicting new dose−response matrix entries
1.0
1.0
Tissue type
0.8
0.6
0.5
comboFM−5
b
Breast (6)
CNS (7)
Colon (6)
Haematological (6)
Melanoma (9)
NSC Lung (10)
Ovarian (6)
Prostate (2)
Renal (8)
0.7
Pearson correlation
Pearson correlation
0.5
Drug classes
0.0
Chemo − Chemo (226)
Chemo − Other (122)
Other − Other (9)
Targeted − Chemo (190)
Targeted − Other (32)
Targeted − Targeted (38)
comboFM−5
comboFM−2
comboFM−1
RF
e
1.0
0.9
Tissue type
0.8
Breast (6)
CNS (7)
Colon (6)
Haematological (6)
Melanoma (9)
NSC Lung (10)
Ovarian (6)
Prostate (2)
Renal (8)
0.7
0.5
comboFM−5
Pearson correlation
Pearson correlation
0.6
0.5
RF
Chemo − Chemo (226)
Chemo − Other (122)
Other − Other (9)
Targeted − Chemo (190)
Targeted − Other (32)
Targeted − Targeted (38)
0.0
Drug classes
comboFM−1
comboFM−2
comboFM−5
comboFM−2
comboFM−1
RF
f
Predicting new drug combinations
1.0
1.0
Tissue type
Breast (6)
CNS (7)
Colon (6)
Haematological (6)
Melanoma (9)
NSC Lung (10)
Ovarian (6)
Prostate (2)
Renal (8)
comboFM−5
comboFM−2
comboFM−1
RF
0.5
0.9
Pearson correlation
Pearson correlation
0.5
0.6
0.7
RF
Predicting new dose−response matrices
0.8
−0.5
comboFM−1
1.0
c
comboFM−2
0.9
Chemo − Chemo (226)
Chemo − Other (122)
Other − Other (9)
Targeted − Chemo (190)
Targeted − Other (32)
Targeted − Targeted (38)
comboFM−5
0.0
Drug classes
comboFM−2
comboFM−1
RF
Figure 3.4. Predictive performance of 5th order comboFM-5, 2nd order comboFM-2 and 1st order comboFM-1, along with random forest (RF),
evaluated across tissue types (a-c) and drug classes (d-f) under three
prediction scenarios: a, d) predicting new entries in dose response
matrices; b, e) predicting entire dose response matrices, and c, f)
predicting new drug combinations. In the box plots, horizontal lines
indicate the median Pearson correlation, while the lower and upper
hinges correspond to the 25th and 75th percentiles. The whiskers
show the highest and lowest values within 1.5 times the interquartile
range (IQR), and points outside the whiskers are outlier predictions.
Reprinted from Publication I.
60
4. Predictive modelling of disease risks
using metabolomic biomarkers
(Publications II-IV)
Publications II-IV collectively aimed to investigate the role of NMR metabolomic biomarkers in predicting disease risks. These studies leveraged
a uniquely large novel metabolomics dataset from the UK Biobank, the
largest of its kind to date. The scale of this dataset, combined with the
extensive range of health outcome data, facilitated a comprehensive evaluation of the associations between individual metabolomic biomarkers and
disease risk, offering valuable new insights into the broader applicability of these biomarkers for predicting the future onset of various health
outcomes.
4.1 Datasets
Publications II-IV used NMR metabolomic data from three extensive
prospective cohort studies, including UK Biobank, Estonian Biobank, and
THL Biobank. The UK Biobank served as the primary dataset for discovery
analyses, while the THL Biobank was used for replication in Publications
III and IV, and the Estonian Biobank for replication in Publication IV.
These cohorts comprise adults from European countries, each with distinct
recruitment methods, blood sampling protocols, timeframes, age distributions, and approaches to capturing outcomes via electronic health records.
Ethical approval was granted by local ethics committees, and all participants provided written informed consent. Details of these datasets are
provided in Publications II-IV and summarized below.
4.1.1
UK Biobank
The UK Biobank is an internationally accessible biomedical database
consisting of half a million participants aged 40–69 at the time of enrollment. Recruitment, which took place between 2006 and 2010, was
conducted on a voluntary basis across 22 assessment centers located in
Scotland, England, and Wales. The study continuously collects follow-up
61
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
data through electronic health records, covering hospital admissions, primary care visits, and mortality records, which are regularly updated. The
UK Biobank study received ethical approval from the North West MultiCentre Research Ethics Committee, and all participants provided written
informed consent. The biomarker profiling of plasma samples using NMR
spectroscopy was approved and data accessed under the approval of UK
Biobank project ID 30418.
4.1.2
THL Biobank
The THL Biobank data used in Publications III–IV comprises five population cohorts (National FINRISK Studies from 1997, 2002, 2007, and
2012, along with the Health 2000 Survey), totaling approximately 35,000
individuals. Each of these cohorts represent a distinct random sample of
individuals aged 25–98 across Finland. Disease outcome data are linked
from national hospital registries and reimbursement records, with followup data available until 2017. The THL Biobank cohorts were approved by
the Coordinating Ethical Committee of the Helsinki and Uusimaa Hospital
District, Finland, and the data were accessed under research application
number BB2016_86.
4.1.3
Estonian Biobank
The Estonian Biobank (EBB) comprises around 210,000 individuals, representing approximately 20% of the adult Estonian population. The recruitment was conducted between 2002 and 2022 on a voluntary basis.
The EBB database is routinely updated through integration with multiple
national registries, hospital databases, and the national health insurance
fund’s database, which contains detailed records of treatments and service
billing information. The study was approved by the Estonian Committee
on Bioethics and Human Research, and data access was granted under
research approval number 1.1-12/2770.
4.1.4
NMR metalomic biomarker profiling
The metabolomic biomarker data in these cohorts was generated by the
NMR platform from Nightingale Health. The platform quantifies 249
metabolomic measures from a blood sample in one experimental assay.
The biomarker panel is designed for accurate quantification in a highthroughput manner, and thus primarily covers molecules with high abundance in the blood. A majority of the biomarkers reflect lipoprotein
metabolism, measuring the lipid concentrations and composition in 14
lipoprotein subclasses. These include total cholesterol, free cholesterol,
cholesterol esters, triglycerides, phospholipids, and total lipid concentra-
62
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
tion within each subclass. The panel also covers the absolute concentration
and ratios of abundant fatty acids, glycolysis related metabolites, and small
molecules, such as ketone bodies and amino acids. Apolipoproteins A1
and B, as well as two inflammatory protein measures, glycoprotein acetyls
and albumin, are also quantified due to their high concentration in blood.
The measurement of 37 of these biomarkers were certified for clinical use
in Europe at the time of Publication II-IV. The metabolomic biomarker
data have been made available through the UK Biobank resource, with
the measurements carried out and data released in three phases.
4.2 Predictive modelling of severe infectious disease and COVID-19
risk using metabolomic biomarkers (Publication II)
The coronavirus disease 2019 (COVID-19) pandemic presented a global
health threat at the time, affecting healthcare systems and societies worldwide. Protecting individuals most vulnerable to severe and potentially
fatal outcomes was central to public health policies, with more stringent
social distancing and other preventive measures recommended for highrisk groups. While older age and chronic health conditions were recognized
as major risk factors 60,61 , the variation in individual susceptibility was
not fully understood. Insights into molecular factors predisposing individuals to severe infectious diseases could help elucidate why certain groups
were at an increased risk. The first wave of COVID-19 swept through
Europe in the spring of 2020, coinciding with the completion of the first
tranche of metabolomic biomarker measurements from the UK Biobank.
This motivated our investigation in Publication II to examine whether
NMR metabolomic biomarkers could be associated with severe outcomes
in infectious diseases, including COVID-19. Given the limited availability
of early COVID-19 data within the UK Biobank, we used hospitalization
for pneumonia as a proxy outcome to derive a risk prediction score, which
was subsequently tested using COVID-19 cases.
4.2.1
Study setting and methodology
Using data from the UK Biobank, we analyzed metabolomic biomarkers
from plasma samples collected between 2006 and 2010, approximately a
decade prior to the onset the COVID-19 pandemic. Among 92,725 individuals with available COVID-19 data linkage at the time, there were 652
PCR-confirmed positive cases from hospitalized individuals, considered as
severe cases in this study. As the initial data on COVID-19 was limited,
we also included analyses of severe pneumonia outcomes. Among 105,146
participants with complete metabolomic data and no previous diagnosis
of pneumonia, 2,507 pneumonia cases were recorded in hospital or death
63
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
registries, inferred as severe cases.
Since all COVID-19 cases occurred nearly a decade after the blood sampling at the beginning of the study, time-resolved outcomes for COVID19 were not available. Therefore, to ensure methodological consistency
between the pneumonia and COVID-19 analyses, we employed logistic
regression for biomarker association testing and for developing a multibiomarker risk prediction score. All statistical models were adjusted for
age, sex, and UK Biobank assessment center.
4.2.2
Associations of individual metabolomic biomarkers with
severe pneumonia and COVID-19
Analysis of the individual biomarkers revealed several significant associations with severe pneumonia. For example, higher levels omega-6 and
omega-3 fatty acids, cholesterol, branched-chain amino acids, histidine,
and albumin were associated with reduced risk (Figure 4.1). In contrast,
elevated levels of saturated and monounsaturated fatty acids, as well as
glycoprotein acetyls (GlycA), a marker of low-grade inflammation, were
associated with increased risk. A notably similar pattern of associations
was observed for severe COVID-19. Higher levels of omega-3 and omega-6
fatty acids and albumin were linked to protective effects, while elevated
GlycA concentrations were associated with increased risk (Figure 4.2).
In addition to the discovered highly similar overall association pattern between COVID-19 and pneumonia, novel findings from these results include
the lower concentrations of certain amino acids and fatty acids associated
with increased risk for both severe pneumonia and severe COVID-19. Collectively, these findings suggest that alterations in metabolomic biomarkers
may reflect an underlying susceptibility to severe outcomes from COVID19 and pneumonia in the general population. However, the observational
nature of this study limits our ability to establish whether these biomarkers causally contribute to disease risk or simply reflect broader underlying
risk factors. The observed biomarker associations are similar to those
previously reported for all-cause mortality 30 , indicating that these metabolomic profiles may reflect broader frailty rather than being specific to
infectious diseases.
4.2.3
Multi-biomarker score stratifies the risk of severe
infectious diseases
Given that the metabolomic biomarkers are measured simultaneously in
a single NMR measurement, we sought to determine whether combining
these biomarkers into a single risk score could capture the risk even more
strongly. In light of the consistent association patterns between severe
pneumonia and COVID-19, as well as their previously reported shared risk
factors 222 , we developed a risk prediction score for severe pneumonia and
64
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
p value < 0.001
p value ≥ 0.001
Amino acids
Lipoprotein lipids
Total−C
VLDL−C
LDL−C
HDL−C
Triglycerides
0.8
1.0
1.2
1.4
1.6
1.8
Apolipoproteins
ApoB
ApoA1
ApoB/ApoA1
0.8
0.8
1.0
1.2
1.4
1.6
1.8
1.2
1.4
1.6
1.8
1.0
1.2
1.4
1.6
1.8
1.0
1.2
1.4
1.6
1.8
Glycolysis metabolites
1.0
1.2
1.4
1.6
1.8
Glucose
Lactate
0.8
1.0
Fluid balance
Creatinine
Albumin
0.8
0.8
1.0
1.2
1.4
1.6
Fatty acid ratios
Omega−3 %
Omega−6 %
PUFA %
MUFA %
SFA %
DHA %
PUFA/MUFA
Omega−6/Omega−3
Fatty acids
Total fatty acids
Omega−3
Omega−6
PUFA
MUFA
SFA
DHA
Alanine
Glycine
Histidine
Isoleucine
Leucine
Valine
Phenylalanine
Tyrosine
Total BCAA
1.8
GlycA
Glycoprotein acetyls
Inflammation
0.8
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in biomarker concentration
Multibiomarker score
0.8
1.0
1.2
1.4
Infectious
Infectious disease score
disease score
1.6
1.8
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in biomarker concentration
0.8
1.0
1.2
1.4
1.6
1.8
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
Figure 4.1. Associations of metabolomic biomarkers and multi-biomarker infectious disease score to future risk of severe pneumonia, in terms of
odds ratios (N = 105 146, 2507 events). Associations are shown for 37
clinically validated biomarkers measured by the NMR platform. Odds
ratios represent the change in risk per 1-SD increase in biomarker
levels. Models are adjusted for age, sex, and UK Biobank assessment
center. Horizontal bars represent 95% con dence intervals. Abbreviations: PUFA, polyunsaturated fatty acids; MUFA, monounsaturated
fatty acids; SFA, saturated fatty acid; DHA, docosahexaenoic acid;
BCAA, branched-chain amino acids. Reprinted and modi ed from
Publication II.
evaluated its performance in predicting both pneumonia and COVID-19.
Using 50% of the study population for model training, we applied logistic
regression with LASSO regularization, employing 5-fold cross-validation
to tune the regularization parameter. The resulting risk score, based on a
weighted sum of 25 biomarkers, was termed "infectious disease score" and
subsequently tested in the remaining 50% of the study population.
The multi-biomarker infectious disease score demonstrated stronger associations with the severe infectious disease outcomes than any individual
biomarker. For severe pneumonia, the score had an odds ratio (OR) of
1.7 (95% CI 1.6–1.8) per one standard deviation (SD) increment (Figure
4.1). Individuals in the highest quintile of the score had nearly four times
the risk of developing severe pneumonia compared to those in the lowest
65
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
p value < 0.001
p value ≥ 0.001
Amino acids
Lipoprotein lipids
Total−C
VLDL−C
LDL−C
HDL−C
Triglycerides
Alanine
Glycine
Histidine
Isoleucine
Leucine
Valine
Phenylalanine
Tyrosine
Total BCAA
0.8
1.0
1.2
1.4
Apolipoproteins
ApoB
ApoA1
ApoB/ApoA1
1.0
1.2
1.4
1.2
1.4
1.0
1.2
1.4
1.0
1.2
0.8
1.0
1.2
Glycolysis metabolites
1.4
Glucose
Lactate
0.8
1.0
Fluid balance
Creatinine
Albumin
0.8
0.8
1.0
1.2
Inflammation
1.4
Glycoprotein acetyls
GlycA
Fatty acid ratios
Omega−3 %
Omega−6 %
PUFA %
MUFA %
SFA %
DHA %
PUFA/MUFA
Omega−6/Omega−3
0.8
Fatty acids
Total fatty acids
Omega−3
Omega−6
PUFA
MUFA
SFA
DHA
0.8
1.4
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in biomarker concentration
Multibiomarker score
0.8
1.0
1.2
Infectious
Infectious disease score
disease score
1.4
0.8
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in biomarker concentration
1.0
1.2
1.4
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
Figure 4.2. Associations of metabolomic biomarkers and multi-biomarker infectious disease score to future risk of severe COVID-19, in terms of
odds ratios (N = 92 725; 652 events). Associations are shown for 37
clinically validated biomarkers measured by the NMR platform. Odds
ratios represent the change in risk per 1-SD increase in biomarker
levels. Models are adjusted for age, sex, and UK Biobank assessment
center. Horizontal bars represent 95% con dence intervals. Abbreviations: PUFA, polyunsaturated fatty acids; MUFA, monounsaturated
fatty acids; SFA, saturated fatty acid; DHA, docosahexaenoic acid;
BCAA, branched-chain amino acids. Reprinted and modi ed from
Publication II.
quintile (OR 3.8, 95% CI 3.0–4.7). The infectious disease score was also
significantly associated with severe COVID-19 outcomes, with an OR of 1.4
(95% CI 1.3–1.5) per SD increment (Figure 4.2). Individuals in the highest
quintile had almost three times the risk of severe COVID-19 compared
to those in the lowest (OR 2.9, 95% CI 2.1–3.8), despite the decade long
time lag from blood sampling until COVID-19. Notably, to mirror the time
gap to the COVID-19 pandemic, when the analysis of severe pneumonia
was restricted to events occurring 7–11 years after blood sampling, the
association magnitude for severe pneumonia (OR 2.6, 95% CI 1.7–3.9;
highest vs. lowest quintile) was found to be comparable to that of severe
COVID-19 (Figure 4.3).
However, to effectively screen for the susceptibility to severe COVID-19,
66
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
a
b
Mimicking the decade lag to the COVID−19 pandemic
Mimicking a preventative screening scenario carried out today
Within 7 years from blood sampling (943 events)
7−11 years from blood sampling (307 events)
Within 2 years from blood sampling (162 events)
2−11 years from blood sampling (1088 events)
1
2
3
4
5
6
7
10
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
1
2
3
4
5
6
7
10
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
Figure 4.3. Association of the infectious disease score with long-term and shortterm risk for severe pneumonia in the UK Biobank test set. a) Odds
ratios for severe pneumonia events that occurred within the rst 7
years after the blood sampling, compared those occurring 7 11 years
after blood sampling. b) Odds ratios for severe pneumonia events
that occurred within the rst two years after blood sampling and those
occurring after the rst two years. Models are adjusted for age, sex,
and assessment centre. Odds ratios are presented as comparisons
between individuals in the highest and lowest quintiles of the infectious
disease score. Reprinted and modi ed from Publication II.
a strong association with short-term risk would be required. However,
such data were not available for COVID-19 in the UK Biobank. When the
analysis of severe pneumonia was limited to events that occurred within
the first two years after blood sampling, the short-term risk increase was
considerably stronger. In this 2-year follow-up analysis, individuals in
the highest quintile of the infectious disease score were over seven times
more likely to develop severe pneumonia compared to those in the lowest
quintile (OR 7.9, 95% CI 4.1-15.6). If a similar short-term risk increase
applied to COVID-19, this score might prove useful in identifying highrisk individuals for severe COVID-19. However, the lack of metabolomic
biomarker data from blood samples taken shortly before the pandemic
prevented us from directly assessing short-term COVID-19 susceptibility.
4.2.4
Conclusions
In conclusion, this study demonstrated that a signature of metabolomic
blood biomarkers, collected a decade prior to the COVID-19 pandemic, is associated with increased susceptibility to severe pneumonia and COVID-19.
This was the first study to show that many of these biomarkers, previously
linked primarily to cardiometabolic diseases, were also associated with
the risk of severe infectious disease outcomes. These findings suggest
that metabolomic biomarkers should not be viewed solely as biomarkers of
cardiometabolic risk but may also reflect the susceptibility to infectious
diseases and potentially other conditions.
The multi-biomarker infectious disease score developed in this study
67
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
was particularly highly predictive of severe pneumonia. Notably, the
associations observed with the multi-biomarker score were stronger than
those reported for many pre-existing health conditions, such as diabetes
and obesity 222 . Individuals in the highest quintile of the score had over
sevenfold increase in short-term pneumonia risk compared to those in
the lowest quintile, providing a strong basis for identifying individuals at
high risk. If a similar short-term risk elevation had applied to COVID19, metabolomic biomarker profiling could have complemented existing
methods for identifying high-risk individuals also for severe COVID-19.
However, the introduction of COVID-19 vaccines rapidly changed the
public health needs at the time these findings were published. Regardless
of the translational applications, these results offer new insights into how
metabolomic biomarkers associate with susceptibility to severe COVID-19
and other infectious disease outcomes.
4.3 Systematic characterization of the associations of metabolomic
biomarkers across common diseases (Publication III)
In Publication III, the aim was to catalogue the associations of NMR
metabolomic biomarkers with the risk of a wide range of common diseases.
This extends beyond the infectious diseases covered in Publication II and
the cardiometabolic focus that has previously prevailed in the research
on these biomarkers. Leveraging the uniquely large sample size and
extensive health outcome data available in the UK Biobank, this study
sought to elucidate the broader associations of these biomarkers on disease
susceptibility, covering over 700 diseases. Prior research had characterized
the associations of the inflammatory biomarker GlycA, one of the most
prominent biomarkers measured by NMR, in relation to over 400 common
diseases in a cohort of 11,861 individuals 144 . The present study, with a
sample size ten times larger, is the first to systematically evaluate all
biomarkers measured by the NMR platform and across many new disease
endpoints.
4.3.1
Study setting and methodology
We systematically analyzed the associations of 249 NMR biomarker across
717 incident diseases, 648 prevalent diseases, and 77 causes of death.
Disease endpoints were defined using three-character ICD-10 codes (International Classification of Diseases, 10th Revision), derived from hospital
episode statistics and death records. The analysis included all diseases
with at least 50 occurrences within 10 years following blood sampling (for
incident and mortality outcomes) or those documented in prevalent records
up to 25 years prior to sampling. Association testing for incident and mor-
68
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
tality outcomes was conducted using Cox proportional hazards regression,
while logistic regression was applied for prevalent disease outcomes. All
models were adjusted for age, sex, and UK Biobank assessment center,
with age used as the time scale in the Cox regression models.
4.3.2
Biomarker associations across a broad range of diseases
Among 717 incident disease outcomes, a total of 33,764 significant biomarker
associations were observed at a multiple testing corrected p<5e-5. Similarly, 26,035 significant associations were observed for 648 prevalent
diseases, and 3,055 associations for 77 causes of death. Notably, these associations extended beyond cardiometabolic diseases, spanning nearly all
ICD-10 chapters and highlighting metabolomic biomarkers as risk markers for a broad spectrum of conditions, including cancers, mental health
outcomes, and musculoskeletal disorders (Figure 4.4a).
Examining the biomarker associations revealed both broad and diseasespecific patterns. For example, GlycA was significantly associated with
32% of the disease endpoints studied, showing a median hazard ratio of
1.26 per 1-SD increment. These associations included conditions such as
gout, type 2 diabetes, kidney diseases, and myocardial infarction (Figure
4.4b). Similarly, the ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA) showed widespread associations across
various diseases (Figure 4.4c). In contrast, certain biomarkers had more
disease specific patterns, such as alanine, which was primarily linked to
diabetes and its complications (Figure 4.4d). Total branched-chain amino
acids exhibited divergent associations, being positively linked to metabolic
diseases but inversely associated with lung diseases and smoking-related
conditions (Figure 4.4e). Another study was later on dedicated to exploring
these opposing effects of branched-chain amino acids on health and disease,
including causal assessments 223 .
These findings were further replicated in a meta-analysis of five independent population-based cohorts from Finland, all measured using the same
NMR platform. The replication analysis confirmed consistent associations,
particularly for amino acids, polar metabolites, and fatty acid ratios. Although some deviations were observed in absolute fatty acid measures and
LDL-related biomarkers, the overall concordance between the UK Biobank
and the Finnish cohorts demonstrates the robustness and transferability
of the findings.
4.3.3
Insights into shared biomarker signatures
Through systematic characterization of biomarker associations across
diseases, we obtained “biomarker signatures” that capture the unique
patterns of biomarker associations for each disease. We then performed
69
! "
! !
# $
" " # %
#
" $
# &
"
' %
#
$" ! # # %
(
* ) ( + (
& '
) & , - + #+
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
$3455
. 5 ( .$$ "0>#* ( 8 - ((
-6 * 2( ( ;9 <
=77 "* * 0 (
-8 2( + ,5 ( +
.86 ( + 96 * *
=6 / +( #
$7 "* , (+ .68 ( + + (/ ( ( ,5 * * * (
, . ;< *
,5 # %5 : (0 ((/ # (
95 *# + 1 (
98 0 ( + 535%6
535%6
$34
43%
3%
$37%8
53%$%
%38587
6366
365
34%
3%8
534$4
6355
535%6
5$%
536$7
63%87
3488
3
. 5 ( .7 +( ( .$$ "0>#* ( 8 - ((
* ( ((
78 ' ((
98$ "* ( + 0
=77 "* * 0 (
-6 * 2( ( ;9 <
-8 2( + 96 * *
.86 ( + , . ;< *
%5 : (0 ((/ # (
=6 / +( #
,5 * * * (
8 "* + 2
98 0 ( + 4 "* ( +( *
535%6
344%
8%7
535%6
43$5$5
3%$6
3488
367
3%%
53457
3556
3766
73$5$7
535%6
53%67
%3%88
%36
83%$4
73647
838$$
3$
,<
!%$ @ (( ( 3 >3
$% * ( 3 >3
. ( -6 (( ( 3 >3
. 5 ( .7 +( ( .$ "* (( + 48 - * + > / -.
-6 * 2( ( ;9 <
, (+ ,5 # .86 ( + .$$ "0>#* ( 98$ "* ( + 0
8 - ((
-8 2( + ,5 * * * (
$7 "* , . ;< *
35 37 3$ 36
!?( ;4
+ ( (
- + * ( ( (+# #
.(/ ( (
/ &*0 ( -(0 ((
+ * 0 + * ( (1
+ * ( ( + * + * + * (#0 + * 2 ( + * 2 ( 0 + * # 36
3
!?( ;4
,<
$38$
357
5357
537
535%6
$37$8
%58
35
36$
734%5
3%56
53%55
53$$
63%%%
37
35
%387
365
37
63%4
3
37 36 535
!?( ;4
,<
!%$ @ (( ( 3 >3
. 5 ( .7 +( ( $% * ( 3 >3
. ( .$$ "0>#* ( 6 " > * # +
=78 &*
=7% .*
78 ' ((
97 ,# *
.86 ( + ,5 # 96 * *
8 "* + 2
, . ;< *
-6 * 2( ( ;9 <
8 - ((
,5 * * * (
63577
83$545
435%8
3%4
347
53457
3$%5
63%
435
83657
5355%
53%
%3474
53
835%
7375
837$7
5
735
6348
36
35
!?( ;4
3$
,<
Figure 4.4. Biomarker associations for incident disease outcomes across a range
of diseases. a) Total number of signi cant associations for each
biomarker at a statistical signi cance threshold of p<5e-5. The colour
coding denotes the proportion of signi cant associations by ICD-10
chapters. b e) Twenty most signi cant associations for four selected
biomarkers: b) glycoprotein acetyls (GlycA), c) ratio of polyunsaturated
fatty acids to monounsaturated fatty acids (PUFA/MUFA), d) alanine,
and e) branched-chain amino acids (BCAA). The associations are
ranked by descending absolute effect size, with each association
represented by hazard ratios (HRs) and 95% con dence intervals (CI)
per standard deviation increase in biomarker concentration. All models
are adjusted for sex and UK Biobank assessment centre, using age
as the time scale in Cox regression. Reprinted from Publication III.
70
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
Figure 4.5. Example comparison of the biomarker signatures for the incidence of
acute myocardial infarction (ICD-10 code I21) and heart failure (ICD10 code I50), here illustrated for 37 biomarkers from the NMR platform
certi ed for clinical use. Hazard ratios (HRs) for the biomarkers are
represented as individual points, with corresponding 95% con dence
intervals (CI) displayed through vertical and horizontal error bars.
Reprinted from Publication III.
clustering analyses of these biomarker signatures to gain insights into
differences and similarities between the metabolomic association profiles of
diseases. For instance, type 2 diabetes exhibited highly similar biomarker
patterns of association with several of its complications, such as retinal
disorders and polyneuropathies. In contrast, diseases like pneumonia
and bacterial infections, as well as chronic obstructive pulmonary disease
(COPD) and lung cancer, clustered together.
However, differences were also observed between seemingly related conditions. For instance, different types of heart disease, such as acute myocardial infarction and heart failure, which are often grouped together in
composite endpoints for risk prediction, had notably different biomarker associations (Figure 4.5). Further analysis revealed that different subtypes of
cardiovascular disease, including various types of chronic ischaemic heart
disease, myocardial infarction, and different types of stroke, had distinct
association patterns. Interestingly, the biomarkers often had stronger
associations with other circulatory diseases than with ischemic stroke and
myocardial infarction, suggesting that separating these endpoints could
potentially improve risk prediction.
71
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
4.3.4
Accounting for the effects of lipid lowering medications
To account for the potential impact of lipid-lowering medications, we conducted additional analyses excluding individuals using these treatments.
However, removing these participants could introduce selection bias by
disproportionately excluding individuals with specific health profiles. To
address this limitation, we replicated the analyses in the FINRISK 1997
cohort, where the prevalence of cholesterol-lowering medication use was
much lower due to sampling occurring before the widespread adoption of
statins for primary prevention. Most biomarker associations remained
consistent across the two cohorts, including the lack of association between
LDL cholesterol and major adverse cardiovascular events (MACE). The
inverse associations of LDL cholesterol with chronic kidney failure and
all-cause mortality were also replicated.
To further evaluate the effects of lipid-lowering treatments, we stratified
the analyses by age tertiles. Because the use of cholesterol-lowering and
other medications increases with age, younger age groups are less likely to
be influenced by these sources of bias. The age-stratified analysis revealed
stronger biomarker associations in the younger age tertiles for several
biomarkers. Notably, LDL-related biomarkers showed weaker associations
in older individuals, with some association magnitudes even reversing
direction in non-circulatory diseases, likely due to the higher prevalence
of statin use in older populations. Inflammatory biomarkers and several
amino acids, which are not affected by lipid-lowering treatments, also
displayed stronger associations in younger participants. This suggests that
the observed age-related differences are not solely attributable to statin
use but may reflect broader age-related changes in biomarker-disease
associations.
4.3.5
Conclusions
This study highlighted widespread associations of metabolomic biomarkers
with a broad range of diseases, indicating that the metabolomic biomarkers
are risk markers beyond cardiometabolic diseases. These findings reinforce
earlier reports linking metabolomic biomarkers to all-cause mortality and
multimorbidity 29,30 , as many biomarkers were found to be associated with
leading causes of mortality and morbidity. While inflammatory biomarkers
like GlycA have been previously linked to various diseases 143,144 , this study
for the first time extends these findings to circulating fatty acids, amino
acids, and detailed lipoprotein measures. The widespread associations of
biomarkers like GlycA and MUFA% demonstrate their potential to inform
on the risk of multi-disease outcomes.
The results from this study have been compiled into a publicly available
biomarker-disease atlas, an interactive web tool that allows users to visu-
72
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
alize association results and download summary statistics. We anticipate
that this resource will serve as a valuable foundation for further research,
facilitating deeper investigations beyond biomarker discovery.
4.4 Metabolomic and genomic prediction of common diseases
(Publication IV)
Building upon the strong associations between metabolomic biomarkers
and various diseases established in Publication III, along with the promising multi-biomarker prediction results from Publication II, Publication
IV aimed to evaluate the predictive performance of metabolomic multibiomarker risk scores for leading chronic diseases. A primary novelty of
this study was the assessment of the transferability of these risk scores
across multiple large population cohorts. Additionally, the study compared the predictive power of metabolomic risk scores against other data
sources, such as polygenic risk scores and established clinical risk factors.
Furthermore, the study included a subset of individuals with biomarker
measurements taken at two time points to assess whether changes in
biomarker profiles over time could alter future disease risks. This research
was a collaborative effort involving many scientists and extensive work in
creating and curating the data resources. However, in this section, I will
focus on my primary contributions, which involved the development of risk
prediction models and the evaluation of their performance and calibration
across multiple biobanks.
4.4.1
Study setting and methodology
This study leveraged NMR metabolomic biomarker profiles from three major population biobanks: the UK Biobank, Estonian Biobank, and Finnish
THL Biobank, comprising 700,217 individuals with extensive health outcome data. In this study, we focused on 12 leading causes of morbidity
in the WHO European region, which together account for over one-third
of total disability-adjusted life years (DALYs) in this population. Disease outcomes were defined as the four-year incidence of these conditions,
accommodating the shorter follow-up period available in the Estonian
Biobank.
To derive metabolomic risk scores for predicting disease incidence, regularized Cox proportional hazards regression was used to train the models
in a half of the UK Biobank population. Age and sex were included as fixed
covariates, and LASSO regularization was applied for variable selection
among the 36 metabolomic biomarkers certified for clinical use, using 5-fold
cross-validation to tune the regularization parameters. The performance
of these metabolomic risk scores was then tested in the remaining half of
73
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
the UK Biobank and externally validated using data from the Estonian
and THL biobanks. Importantly, since the biomarkers are measured in
absolute concentration units (e.g. g/L or mmol/L), the metabolomic risk
scores derived from the UK Biobank could be directly applied to the other
cohorts without requiring cohort-specific normalization of the biomarker
values. This is a notable distinction from typical ’omics analyses, where
such normalization is often necessary to account for varying measurement
scales, platforms, and batch effects 224,225 .
4.4.2
Metabolomic risk scores stratify disease risk
The metabolomic risk scores effectively stratified disease risk across all
12 conditions, demonstrating a clear increase in event rates with higher
risk score percentiles (Figure 4.6). The risk increase was particularly
elevated in the upper tails of the distribution, with the effect being notably
pronounced for type 2 diabetes, alcoholic liver disease and liver fibrosis
and cirrhosis. In terms of discrimination, the majority of the metabolomic
risk scores demonstrated good performance across the studied diseases,
with area under the ROC curve (AUC) ranging from 0.70 to 0.95, except
for depression, which had a lower AUC.
The models were further evaluated to illustrate their utility in a practical risk prediction scenario. Specifically, we established a threshold
corresponding to the top decile of risk scores in the UK Biobank training set. Individuals in other cohorts who exceeded this threshold were
designated as the high-risk group and compared against the remaining
population. Across all three biobanks, this high-risk group consistently
exhibited elevated disease risks, as indicated by hazard ratios (Figure
4.7). Exceptions included alcoholic liver disease and depression, which
demonstrated statistically significant heterogeneity in the meta-analysis
(Cochran’s Q-test, multiple testing corrected p<0.004). The meta-analysis
comparing this high-risk group to the remaining population revealed hazard ratios of approximately 10 for liver disease and diabetes, around 4
for lung cancer and chronic obstructive pulmonary disease (COPD), and
around 2.5 for myocardial infarction, stroke, and vascular dementia. Notably, the UK Biobank showed the highest effect sizes for only 4 out of
the 12 diseases, indicating that the metabolomic risk scores and their
associations are generalizable across study populations rather than being
overfitted to specific chracteristics of the UK Biobank training data.
Translating risk prediction models from biobank research into clinical
practice requires both robust discrimination and calibration. Calibration
was assessed by comparing observed and predicted event rates across
deciles within each biobank and computing the corresponding calibration
slopes. Generally, the metabolomic risk scores reflected good calibration. In
the UK Biobank test set, calibration slopes ranged from 0.95 to 1.24 across
74
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
Myocardial infarction
Ischemic stroke
Intracerebral
hemorrhage
Lung cancer
0.25%
2.5%
2.0%
1.5%
0.9%
0.20%
1.5%
0.15%
1.0%
0.6%
0.10%
1.0%
0.5%
0.5%
0.0%
0.00%
0.0%
0
25
50
75
100
0.3%
0.05%
0
Type 2 diabetes
25
50
75
100
0.0%
0
Chronic obstructive
pulmonary disease
25
50
75
100
0
Alzheimer's disease
25
50
75
100
Vascular and
other dementia
Incidence (%)
0.5%
3%
10%
0.4%
0.10%
0.3%
2%
5%
0.2%
0.05%
1%
0%
0.1%
0.00%
0%
0
25
50
75
100
0
Depressive disorders
25
50
75
100
0.0%
0
Alcoholic liver disease
25
50
75
100
1.5%
2.4%
0
Cirrhosis of the liver
25
50
75
100
Colon and
rectum cancers
1.00%
1.5%
2.0%
0.75%
1.0%
1.0%
1.6%
0.50%
0.5%
0.5%
1.2%
0.25%
0.0%
0
25
50
75
100
0.0%
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Percentile of metabolomic score
Figure 4.6. Observed incidence of the 12 diseases across one-percent bins of
the metabolomic risk score, calculated as a sample size-weighted
mean of the 4-year incidence across the three biobank cohorts (N =
481,678). The red shaded area represents the top 10% of age and sex
adjusted metabolomic risk score. The horizontal dashed line indicates
overall population incidence. Reprinted from Publication IV.
diseases, which is expected since the models were trained using data from
the same biobank. In the Estonian Biobank, calibration slopes ranged from
0.76 to 1.16, except for depression, which exhibited a lower slope of 0.42,
likely reflecting differences in diagnostic criteria and recording practices
between countries. In the THL Biobank, calibration slopes ranged between
1.03-1.21.
These results suggest that the metabolomic risk scores achieve a reasonable degree of calibration across diverse population cohorts, with some
variability influenced by cohort-specific factors. Differences in participant
recruitment strategies, such as the volunteer-based enrollment in the UK
Biobank and Estonian Biobank, may contribute to these inconsistencies.
Nonetheless, this level of calibration is comparable to widely used clinical tools. For instance, the Pooled Cohort Equations for cardiovascular
risk assessments have shown similar calibration when applied across different populations, such as between the US and Canada 226 . Similarly,
the QRISK3 cardiovascular risk model, widely implemented in the UK,
has demonstrated comparable calibration performance when applied to
75
Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV)
Myocardial infarction
Meta−analysis
Ischemic stroke
UK Biobank
THL Biobank
Estonian Biobank
Intracerebral hemorrhage
Lung cancer
Type 2 diabetes
Chronic obstructive pulmonary disease
Alzheimer's disease
Vascular and other dementia
Depressive disorders
Alcoholic liver disease
Cirrhosis of the liver
Colon and rectum cancers
1
3
5
10
30
Hazard ratio (95% CI), highest risk decile vs. remaining population
Figure 4.7. Hazard ratios for metabolomic risk scores across the 12 studied diseases, comparing individuals in the highest risk decile to the rest of
the population. Results are presented for three biobanks: UK Biobank
(purple), THL Biobank (teal), and Estonian Biobank (orange) (N =
481,678). Meta-analysis of these results from the three biobanks is
displayed in black. Filled circles indicate statistically signi cant associations (p<0.004), while hollow circles represent non-signi cant
associations. Horizontal error bars show the 95% con dence intervals
(CI). Reprinted from Publication IV.
external cohorts like the UK Biobank 227 .
4.4.3
Conclusions
This study highlights the potential of NMR metabolomic biomarkers in
stratifying the risk for multiple chronic diseases. By integrating data from
three large population biobanks, comprising over 700,000 individuals, we
developed and validated metabolomic risk scores for 12 major diseases
that contribute substantially to the burden of chronic diseases. Notably,
the ability to apply these risk scores across different biobanks without
cohort-specific normalization demonstrates robustness and transferability of the models. The consistent calibration and predictive performance
observed across cohorts, despite differences in age ranges, fasting protocols and disease prevalence, indicate the potential utility of these risk
scores in diverse real-world settings. These findings further position metabolomic biomarkers as promising tools for comprehensive multi-disease
risk assessment.
76
5. Machine learning with comprehensive
interaction modelling for disease risk
prediction (Publication V)
In Publication V, we aimed to address the challenge of estimating comprehensive interaction effects among predictors in time-to-event prediction
models. We introduced survivalFM, a novel machine learning extension
to Cox proportional hazards regression. Unlike standard Cox regression,
which assumes linear effects and requires pre-specified interaction terms,
survivalFM automatically estimates all potential interaction effects among
predictor variables. This method builds on the concept of factorizing the interaction parameters from factorization machines (FM). This concept was
successfully applied in Publication I for predicting drug combination responses and here for the first time taken to the context of survival analysis.
While survivalFM can be used for modelling any time-to-event outcome,
in Publication V, we highlight its applicability in disease risk prediction,
using data from the UK Biobank.
5.1 Foundations of survivalFM
survivalFM is an extension of the widely used Cox proportional hazards
model 32 , which relates time-to-event outcomes to a set of predictor variables through a hazard function defined as:
h(t|x) = h 0 (t) exp( f (x))
(5.1)
where h0 (t) is the baseline hazard and exp( f (x)) is the partial hazard.
In the standard formulation, the partial hazard is parameterized by a
linear combination of predictor variables f (x) = Ø> x. survivalFM extends
this formulation by incorporating an estimation of all pairwise interaction
effects through a factorized parametrization (Figure 5.1a-b):
f (x) = Ø> x +
X
1∑ i 6= j ∑ d
hp i , p j i x i x j
(5.2)
where h·, ·i denotes the inner product, and d represents the number
of predictor variables. The first part captures the linear effects of the
77
Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
predictors, similar to the standard Cox regression model. The second part
captures the pairwise interaction effects between all predictors x i and x j .
Rather than directly estimating the interaction terms Ø i j , the factorized
parameterization approximates these effects using an inner product of two
low-rank latent vectors Ø̃ i j = hp i , p j i. This approach substantially reduces
the number of parameters to be estimated, as the rank of the factorization
is typically much lower than the total number of predictors (k ø d ). Further
details of this factorization approach are described in Section 2.5.2.
This method avoids the statistical and computational challenges associated with directly estimating interaction terms when numerous predictor
variables are present. Coupled with an efficient quasi-Newton optimization algorithm (BFGS; Broyden–Fletcher–Goldfarb–Shanno 228–231 ), this
method facilitates comprehensive modeling of interaction effects even with
many predictor variables. Notably, unlike many other advanced machine
learning methods for survival analysis, this approach preserves the interpretability of the underlying model through access to the estimated effects
of the individual predictors and their interactions.
5.2
Study population and evaluation settings
To analyze if survivalFM could enhance disease risk prediction performance and offer insights into risk factor interactions, we conducted analyses using data from the UK Biobank. We focused on the 10-year incidence
of ten selected diseases as our outcomes of interest. To evaluate model
performance across varying data sources, we designed four prediction
scenarios incorporating a range of predictors, from standard clinical variables to ’omics-based data sources, including biochemistry measures, blood
counts, metabolomic biomarkers, and polygenic risk scores. To determine
the benefits of survivalFM in capturing complex interaction effects, we
compared its performance to that of standard Cox proportional hazards
regression (Figure 5.1b), applying L2 regularization in both methods to
manage complexity and mitigate overfitting. Model performance was assessed using 10-fold cross-validation, with a 20% validation set in each
fold used to optimize the regularization parameters.
5.3
Improved prediction of disease risk across various settings
We demonstrated that survivalFM is capable of identifying predictive interaction terms and improving risk prediction accuracy (Figure 5.2). In
terms of discrimination, measured by the concordance index, survivalFM
showed statistically significant improvements in 26 out of the 40 evaluated
scenarios (65%), with an average increase in concordance index (¢C-index)
78
Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
a
Comprehensive
e interaction
n modelling
g by
y machine
e learning:: survivalFM
Linear effects
All potential
interaction effects
β ∈ Rd
β ∈ Rd×d
βi
d
b
βj
βi,j
d
Low-rank parameter matrix
of the factor vectors
Factorized
parametrization
P ∈ Rd×k
βi,j ≈ pi , pj pi
≈ pj
d
k
d
Method
d evaluation
Standard
d Cox
x regression
Time to event ~ β
x
SurvivalFM
Time to event ~ β x +
pi , pj xi xj
1≤i=j≤d
Figure
5.1. Overview
the method.
a)Biobank
A machine learning method for survival
Disease
prediction of
examples
in the UK
analysis, survivalFM, is designed to estimate linear and all pairwise
i) Case studies with various data modalities
ii) Clinical example – QRISK3
interaction effects among predictors using a factorized parametrization
of the interaction effects ⟨p i , p j ⟩. d denotes the number of predictors
and k is a hyperparameter that de nes the factorization rank for the
interaction terms. The rank of the factorization is typically much lower
than the number of predictor variables ( k d ), enabling computation
of the interaction terms even in the presence of many predictors.
b) The bene ts of integrating comprehensive interaction terms via
survivalFM is evaluated by comparing the prediction performance
to the standard linear Cox regression. Reprinted and modi ed from
Publication V.
of 0.005. In terms of continuous net reclassification improvement (NRI),
survivalFM significantly improved risk reclassification in 39 out of 40
scenarios (97.5%), yielding an average continuous NRI of 37%. Hence,
despite the relatively modest gains in C-indices, the substantial improvement in continuous NRI suggests notable improvements in individual risk
predictions.
A major advantage of survivalFM is its ability to introduce non-linearity
by incorporating comprehensive interaction terms, while still maintaining
interpretability. It achieves this by providing estimates for both main
effects and interactions. Our analysis of these interaction effects revealed
that numerous small interactions often jointly improved prediction accuracy, underscoring the value of capturing the complete interaction struc-
79
Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
ture. Moreover, we demonstrated that capturing these interaction effects
generally requires large sample sizes, with survivalFM showing an increasing performance advantage over standard Cox regression as the sample
size grows.
Figure 5.2. Comparison of the predictive performance of survivalFM against standard linear Cox proportional hazards regression, shown in terms of
differences in concordance index (Δ C-index) and continuous net
reclassi cation improvement (NRI). Results are presented for ten disease outcomes (y-axis), considering four different sources of data: a)
standard risk factors (blue; included in all models), b) clinical biochemistry and blood counts (red), c) metabolomic biomarkers (orange) and
d) polygenic risk scores (green). Horizontal error bars represent 95%
con dence intervals (CIs), estimated using bootstrapping with 1000
resamples. Reprinted from Publication V.
80
Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
5.4 Enhanced cardiovascular risk prediction performance
We further demonstrated that survivalFM can enhance prediction accuracy
in a practical scenario for cardiovascular disease risk prediction. Using
predictors from the widely implemented QRISK3 model 20 , recommended
by the UK clinical guidelines 88 , we compared three models of increasing
complexity: (1) a standard Cox regression model with linear terms, (2) a
Cox regression model including linear terms and age interaction terms
currently included QRISK3, and (3) a survivalFM model that includes
linear terms along with comprehensive interaction effects.
In terms of discrimination, survivalFM showed a statistically significant improvement with a ¢C-index of 0.0018 (95% CI 0.0013–0.0023)
compared to the standard linear model and ¢C-index of 0.0014 (95% CI
0.0010–0.0019) compared to the model incorporating also the QRISK3 age
interactions (Figure 5.3a). In terms of categorical reclassification at a clinically established 10% risk threshold 88 , adding the QRISK3 age interaction
terms provided an overall net reclassification improvement (NRI) of 0.66%
(95% CI: 0.40%–0.93%) over the linear model, while survivalFM achieved
a greater NRI of 1.47% (95% CI: 1.12%–1.82%) (Figure 5.3b-c). Therefore,
these results show that survivalFM more than doubles the performance
gains compared to the improvements seen with the currently included
QRISK3 age-interaction terms alone.
5.5 Conclusions
In conclusion, Publication V introduced survivalFM, a method that extends
Cox regression by adding an estimation all pairwise interaction effects on
time-to-event outcomes. We showed that incorporating these comprehensive interaction effects improves risk prediction performance across various
common diseases and sources of data. Notably, these improvements were
achieved by optimizing the use of existing predictors, without the need
for additional data types. In contrast to many advanced machine learning techniques in survival analysis, a notable advantage of survivalFM
is that it retains interpretability by providing estimated effects for both
individual predictors and their interactions. In Publication V, we further
discuss some of the identified interactions and diseases where survivalFM
showed the greatest benefits. Given its generalizability to other contexts,
survivalFM is expected to find use cases in precision medicine and improve
risk prediction model development.
81
Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V)
Figure 5.3. Evaluation of the performance of survivalFM in a CVD risk prediction
scenario involving predictors from QRISK3 (N = 344 292 with complete
data, 21 534 events). a) Discrimination performance assessed using
concordance index (C-index) across three models: (1) standard Cox
regression with linear terms, (2) standard Cox regression with linear
terms and age interaction terms from QRISK3, and (3) survivalFM
model incorporating linear terms along with all factorized pairwise
interactions. b) Categorical net reclassi cation improvement (NRI)
at a 10% absolute risk threshold, computed relative to standard Cox
model with linear terms. Horizontal error bars indicate 95% con dence
intervals (CIs), estimated via bootstrapping with 1000 resamples. c)
Reclassi cation plots illustrating how including interaction terms in the
more advanced models modi es individual risk predictions. Reprinted
from Publicaion V.
82
6. Concluding remarks
In this dissertation, we have developed computational methods and uncovered novel biological insights that contribute to various aspects of precision
medicine, including predicting the effects of drug combination treatments,
utilizing metabolomic biomarkers in disease risk assessment, and enhancing methodological approaches for disease risk prediction.
Given the vast number of possible drug combinations, computational
methods are essential for guiding experimental research by prioritizing the
most promising combinations for further validation 58,59,178 . In Publication
I, we introduced comboFM, a novel machine learning framework designed
to predict dose-specific responses to drug combinations. Leveraging data
from a large drug combination cancer cell line screen, we showed that comboFM consistently demonstrates accurate prediction performance across
multiple tissue types and drug classes in diverse prediction scenarios. Importantly, it was able to generalize predictions to new drug combinations
not observed during training, offering valuable insights for repositioning
existing drugs into new combinations in cancer treatment. Experimental
validation of previously untested combinations further confirmed the robustness of these predictions, highlighting the potential of comboFM to
advance the development of effective combination therapies in precision
oncology. Moreover, many drugs predicted by comboFM were found to
be currently evaluated in clinical trials against the specific cancer types,
either as single agents or in combinations with other drugs, highlighting
its translational potential.
Looking forward, several research avenues can expand upon the capabilities of comboFM. While the current work validated it using cancer
cell lines, the framework could be adapted to patient-derived samples as
such data become more widely available 232–234 . Given that the input data
required for comboFM are becoming routinely accessible in functional
precision medicine studies, the framework holds potential for broad applicability across various cancer types and therapeutic contexts. However, as
with many molecular profiling technologies, challenges remain in ensuring
that the predictions are reliable across different experimental assay conditions and biological contexts. Therefore, future work should validate these
83
Concluding remarks
models using diverse and well-standardized cell line and patient-derived
datasets to support the wider application of comboFM. Additionally, comboFM has already inspired further methodological advancements, such
as comboLTR 235 , which extends the current framework by removing the
assumption of polynomial symmetry through the use of a latent tensor
reconstruction (LTR) technique.
Beyond treatment strategies, this dissertation also explored risk prediction aspects of precision medicine. Comprehensive biomarker profiling
can provide means for simultaneous risk stratification across multiple
diseases. One promising approach involves the use of NMR metabolomics
to profile small molecules and lipids in blood samples. Publications II–IV
leveraged uniquely large, population-scale NMR metabolomics datasets
to identify novel biomarkers and establish associations of metabolomic
biomarkers across a wide range of health outcomes, including conditions
where metabolomics had not previously been studied at scale. For instance,
Publication II showed that metabolomic biomarkers primarily linked to
cardiometabolic diseases can also predict susceptibility to severe infectious
disease outcomes such as hospitalization or death from COVID-19. Publication III extended these findings by demonstrating widespread associations
of many of these biomarkers across a broad spectrum of common diseases.
In Publication IV, metabolomic risk scores were derived and tested for 12
major chronic diseases, demonstrating consistent predictive performance
across three large biobanks. Collectively, these studies illustrate the potential of comprehensive metabolomic biomarker profiling as a tool for disease
risk prediction.
Future research can build upon these findings to further explore the
role of metabolomics in risk prediction and other applications in precision
medicine. Subsequent studies using the UK Biobank metabolomics dataset
have already resulted in numerous publications, covering topics such as
causal analyses 128,130,236 and other risk prediction studies 151–153 . While
large biobanks are effective for investigating multiple diseases simultaneously, complementary studies in disease-specific clinical cohorts will
help bring greater resolution to these findings. For instance, metabolomic
analyses in clinical cohorts could be used to develop and validate models to
predict the risk of disease progression or complications, extending beyond
the primary prevention focus on first incidence studied here. Additionally,
the UK Biobank participants are not fully representative of the broader
population; as middle-aged volunteers, they tend to be healthier than
average 237,238 . Therefore, studies in more diverse and underrepresented
populations will be essential for further validating these findings. Future work should also evaluate how modern risk prediction tools, such as
those exemplified here, can be integrated into existing clinical workflows
to improve patient stratification and decision-making.
Continuing with disease risk prediction, we also addressed its method-
84
Concluding remarks
ological aspects. Capturing complex nonlinear relationships, such as interactions among predictor variables, can improve the accuracy of risk
prediction models. In Publication V, we introduced survivalFM, a novel
machine learning method for modelling time-to-event outcomes, such as
disease risks. This method extends the widely used Cox proportional hazards regression by incorporating comprehensive interaction effects among
predictor variables using a factorized parametrization approach, similar
to the one applied in Publication I for predicting drug combination effects.
We showed that accounting for the comprehensive interactions improves
the accuracy of risk prediction models across various disease outcomes
and data sources. A notable advantage of survivalFM, compared to many
advanced machine learning techniques in survival analysis, is its ability
to introduce non-linearity through comprehensive interaction terms while
maintaining model interpretability. It provides estimated effects for both
individual predictors and their interactions, making it particularly valuable in settings where understanding the contributions of predictors is
essential for translational applications.
Looking ahead, several avenues exist for expanding the applications
of survivalFM. Given its generalizability, we anticipate survivalFM to
find use cases in precision medicine and enhance time-to-event modeling in large-scale studies involving many predictors. While the method
demonstrated improved prediction performance in the UK Biobank, we
also showed that capturing predictive interaction effects generally requires large sample sizes. This may constrain its use in smaller cohorts.
However, emerging biobank initiatives with extensive clinical and omics
data provide opportunities for further validation. The generalizable nature of survivalFM allows its application to data sources beyond those
addressed in this work, such as incorporating multi-omics data to leverage
predictive interactions across multiple molecular layers. For instance, proteomics has recently shown promise in risk prediction 109–111 , and given a
sufficiently large sample size, survivalFM could be applied to uncover predictive protein-protein interactions. Additionally, from a methodological
perspective, there is potential to extend survivalFM to capture higherorder interactions involving more than two predictors. This could further
improve prediction performance by capturing more complex relationships.
In conclusion, this dissertation has made significant contributions to
precision medicine through the development of predictive modelling approaches and their application to diverse biomedical datasets. By identifying promising drug combinations for cancer treatment, discovering novel
metabolomic biomarkers, and developing methods and models for disease
risk prediction, we have demonstrated how computational approaches
can transform molecular data into actionable insights. Individually and
collectively, these findings advance the translation of molecular data into
prevention and treatment strategies in precision medicine.
85
References
[1] Francis S Collins and Harold Varmus. A new initiative on precision
medicine. New England Journal of Medicine, 372(9):793–795, 2015.
[2] The White House, Office of the Press Secretary. President Obama’s Precision
Medicine Initiative. https://obamawhitehouse.archives.gov/the-press-office/
2015/01/30/fact-sheet-president-obama-s-precision-medicine-initiative,
2015. Accessed: 2024-09-11.
[3] Clare Turnbull, Richard H Scott, Ellen Thomas, Louise Jones, Nirupa
Murugaesu, Freya Boardman Pretty, Dina Halai, Emma Baple, Clare Craig,
Angela Hamblin, et al. The 100 000 Genomes Project: bringing whole
genome sequencing to the NHS. BMJ, 361, 2018.
[4] Our Future Health Protocol. Protocol version 5.0 https://ourfuturehealth.
org.uk/our-research-mission/. Accessed August 12th 2024. , 2024.
[5] Yves Lévy. Genomic medicine 2025: France in the race for precision
medicine. The Lancet, 388(10062):2872, 2016.
[6] Eleanor Wong, Nicolas Bertin, Maxime Hebrard, Roberto TiradoMagallanes, Claire Bellis, Weng Khong Lim, Chee Yong Chua, Philomena
Mei Lin Tong, Raymond Chua, Kenneth Mak, et al. The singapore national
precision medicine strategy. Nature Genetics, 55(2):178–186, 2023.
[7] Shirley Musich, Shaohung Wang, Kevin Hawkins, and Andrea Klemes. The
impact of personalized preventive care on health care quality, utilization,
and expenditures. Population health management, 19(6):389–397, 2016.
[8] Euan A Ashley. Towards precision medicine. Nature Reviews Genetics,
17(9):507–522, 2016.
[9] Holger Fröhlich, Rudi Balling, Niko Beerenwinkel, Oliver Kohlbacher,
Santosh Kumar, Thomas Lengauer, Marloes H Maathuis, Yves Moreau,
Susan A Murphy, Teresa M Przytycka, et al. From hype to reality: data
science enabling personalized medicine. BMC Medicine, 16:1–15, 2018.
[10] Karl Landsteiner. Agglutination phenomena in normal human blood. Wien
Klin Wochenschr, 14:1132–4, 1901.
[11] Mohan Babu and Michael Snyder. Multi-omics profiling for health. Molecular & Cellular Proteomics, 22(6), 2023.
[12] Yehudit Hasin, Marcus Seldin, and Aldons Lusis. Multi-omics approaches
to disease. Genome Biology, 18:1–15, 2017.
87
References
[13] Sanjiv Sam Gambhir, T Jessie Ge, Ophir Vermesh, and Ryan Spitler.
Toward achieving precision health. Science Translational Medicine,
10(430):eaao3612, 2018.
[14] Joshua C Denny and Francis S Collins. Precision medicine in 2030—seven
ways to transform healthcare. Cell, 184(6):1415–1419, 2021.
[15] Kevin B Johnson, Wei-Qi Wei, Dilhan Weeraratne, Mark E Frisse, Karl
Misulis, Kyu Rhee, Juan Zhao, and Jane L Snowdon. Precision medicine,
AI, and the future of personalized health care. Clinical and translational
science, 14(1):86–93, 2021.
[16] Andrew L Beam and Isaac S Kohane. Big data and machine learning in
health care. JAMA, 319(13):1317–1318, 2018.
[17] Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in
medicine. New England Journal of Medicine, 380(14):1347–1358, 2019.
[18] Bruna Gomes and Euan A Ashley. Artificial intelligence in molecular
medicine. New England Journal of Medicine, 388(26):2456–2465, 2023.
[19] Sarah J MacEachern and Nils D Forkert. Machine learning for precision
medicine. Genome, 64(4):416–425, 2021.
[20] Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and
validation of QRISK3 risk prediction algorithms to estimate future risk of
cardiovascular disease: prospective cohort study. BMJ, 357, 2017.
[21] Stephen Kaptoge, Lisa Pennells, Dirk De Bacquer, Marie Therese Cooney,
Maryam Kavousi, Gretchen Stevens, Leanne Margaret Riley, Stefan Savin,
Taskeen Khan, Servet Altay, et al. World Health Organization cardiovascular disease risk charts: revised models to estimate risk in 21 global regions.
The Lancet Global Health, 7(10):e1332–e1345, 2019.
[22] SCORE2 working group and ESC Cardiovascular risk collaboration.
SCORE2 risk prediction algorithms: new models to estimate 10-year risk
of cardiovascular disease in Europe. European Heart Journal, 42(25):2439–
2454, 2021.
[23] Amit V Khera, Mark Chaffin, Krishna G Aragam, Mary E Haas, Carolina
Roselli, Seung Hoan Choi, Pradeep Natarajan, Eric S Lander, Steven A
Lubitz, Patrick T Ellinor, et al. Genome-wide polygenic scores for common
diseases identify individuals with risk equivalent to monogenic mutations.
Nature Genetics, 50(9):1219–1224, 2018.
[24] Ali Torkamani, Nathan E Wineinger, and Eric J Topol. The personal and
clinical utility of polygenic risk scores. Nature Reviews Genetics, 19(9):581–
590, 2018.
[25] George Nicholson, Mattias Rantalainen, Anthony D Maher, Jia V Li, Daniel
Malmodin, Kourosh R Ahmadi, Johan H Faber, Ingileif B Hallgrímsdóttir,
Amy Barrett, Henrik Toft, et al. Human metabolic profiles are stably
controlled by genetic and environmental variation. Molecular systems
biology, 7(1):525, 2011.
[26] Robert W McGarrah, Scott B Crown, Guo-Fang Zhang, Svati H Shah,
and Christopher B Newgard. Cardiovascular metabolomics. Circulation
research, 122(9):1238–1258, 2018.
[27] Zsu-Zsu Chen and Robert E Gerszten. Metabolomics and proteomics in
type 2 diabetes. Circulation research, 126(11):1613–1627, 2020.
88
References
[28] Peter Würtz, Antti J Kangas, Pasi Soininen, Debbie A Lawlor, George
Davey Smith, and Mika Ala-Korpela. Quantitative serum nuclear magnetic
resonance metabolomics in large-scale epidemiology: a primer on-omic
technologies. American Journal of Epidemiology, 186(9):1084–1096, 2017.
[29] Maik Pietzner, Isobel D Stewart, Johannes Raffler, Kay-Tee Khaw, Gregory A Michelotti, Gabi Kastenmüller, Nicholas J Wareham, and Claudia
Langenberg. Plasma metabolites to profile pathways in noncommunicable
disease multimorbidity. Nature Medicine, 27(3):471–479, 2021.
[30] Joris Deelen, Johannes Kettunen, Krista Fischer, Ashley van der Spek,
Stella Trompet, Gabi Kastenmüller, Andy Boyd, Jonas Zierer, Erik B
van den Akker, Mika Ala-Korpela, et al. A metabolic profile of all-cause
mortality risk identified in an observational study of 44,168 individuals.
Nature Communications, 10(1):3346, 2019.
[31] Pasi Soininen, Antti J Kangas, Peter Würtz, Teemu Suna, and Mika AlaKorpela. Quantitative serum nuclear magnetic resonance metabolomics
in cardiovascular epidemiology and genetics. Circulation: Cardiovascular
Genetics, 8(1):192–206, 2015.
[32] David R Cox. Regression models and life-tables. Journal of the Royal
Statistical Society: Series B (Methodological), 34(2):187–202, 1972.
[33] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S
Lauer. Random survival forests. The Annals of Applied Statistics, 2(3):841–
860, 2008.
[34] Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates,
Tingting Jiang, and Yuval Kluger. DeepSurv: personalized treatment
recommender system using a cox proportional hazards deep neural network.
BMC Medical Research Methodology, 18:1–12, 2018.
[35] Chirag Nagpal, Xinyu Li, and Artur Dubrawski. Deep survival machines:
Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health
Informatics, 25(8):3163–3175, 2021.
[36] David J Hunter and Christopher Holmes. Where medical statistics meets
artificial intelligence. New England Journal of Medicine, 389(13):1211–
1219, 2023.
[37] Rebecca Giddings, Anabel Joseph, Thomas Callender, Sam M Janes, Mihaela Van der Schaar, Jessica Sheringham, and Neal Navani. Factors
influencing clinician and patient interaction with machine learning-based
risk prediction models: a systematic review. The Lancet Digital Health,
6(2):e131–e144, 2024.
[38] Cynthia Rudin. Stop explaining black box machine learning models for high
stakes decisions and use interpretable models instead. Nature Machine
Intelligence, 1(5):206–215, 2019.
[39] Gregor Stiglic, Primoz Kocbek, Nino Fijacko, Marinka Zitnik, Katrien Verbert, and Leona Cilar. Interpretability of machine learning-based prediction
models in healthcare. Wiley Interdisciplinary Reviews: Data Mining and
Knowledge Discovery, 10(5):e1379, 2020.
[40] Chris Finan, Anna Gaulton, Felix A Kruger, R Thomas Lumbers, Tina
Shah, Jorgen Engmann, Luana Galver, Ryan Kelley, Anneli Karlsson, Rita
Santos, et al. The druggable genome and support for target identification and validation in drug development. Science translational medicine,
9(383):eaag1166, 2017.
89
References
[41] Jussi Paananen and Vittorio Fortino. An omics perspective on drug target
discovery platforms. Briefings in bioinformatics, 21(6):1937–1953, 2020.
[42] Philippe L Bedard, David M Hyman, Matthew S Davids, and Lillian L Siu.
Small molecules, big impact: 20 years of targeted therapy in oncology. The
Lancet, 395(10229):1078–1088, 2020.
[43] Bert Vogelstein, Nickolas Papadopoulos, Victor E Velculescu, Shibin Zhou,
Luis A Diaz Jr, and Kenneth W Kinzler. Cancer genome landscapes. Science,
339(6127):1546–1558, 2013.
[44] David M Hyman, Barry S Taylor, and José Baselga. Implementing genomedriven oncology. Cell, 168(4):584–599, 2017.
[45] Charles Sawyers. Targeted cancer therapy. Nature, 432(7015):294–297,
2004.
[46] Caitriona Holohan, Sandra Van Schaeybroeck, Daniel B Longley, and
Patrick G Johnston. Cancer drug resistance: an evolving paradigm. Nature
Reviews Cancer, 13(10):714–726, 2013.
[47] Haojie Jin, Liqin Wang, and René Bernards. Rational combinations of
targeted cancer therapies: background, advances and challenges. Nature
Reviews Drug Discovery, 22(3):213–234, 2023.
[48] Deborah Plana, Adam C Palmer, and Peter K Sorger. Independent drug
action in combination therapy: implications for precision oncology. Cancer
discovery, 12(3):606–624, 2022.
[49] Bissan Al-Lazikani, Udai Banerji, and Paul Workman. Combinatorial
drug therapy for cancer in the post-genomic era. Nature Biotechnology,
30(7):679–692, 2012.
[50] Pradipta Das, Michael D Delost, Munaum H Qureshi, David T Smith,
and Jon T Njardarson. A survey of the structures of US FDA approved
combination drugs. Journal of medicinal chemistry, 62(9):4265–4311, 2018.
[51] Jia Jia, Feng Zhu, Xiaohua Ma, Zhiwei W Cao, Yixue X Li, and Yu Zong
Chen. Mechanisms of drug combinations: interaction and network perspectives. Nature Reviews Drug Discovery, 8(2):111–128, 2009.
[52] Salvador Fudio, Alvaro Sellers, Laura Pérez Ramos, Beatriz Gil-Alberdi,
Ali Zeaiter, Mikel Urroz, Antonio Carcas, and Rubin Lubomirov. Anticancer drug combinations approved by US FDA from 2011 to 2021: main
design features of clinical trials and role of pharmacokinetics. Cancer
Chemotherapy and Pharmacology, 90(4):285–299, 2022.
[53] Susan L Holbeck, Richard Camalier, James A Crowell, Jeevan Prasaad
Govindharajulu, Melinda Hollingshead, Lawrence W Anderson, Eric Polley,
Larry Rubinstein, Apurva Srivastava, Deborah Wilsker, et al. The National Cancer Institute ALMANAC: a comprehensive screening resource for
the detection of anticancer drug pairs with enhanced therapeutic activity.
Cancer Research, 77(13):3564–3576, 2017.
[54] Patricia Jaaks, Elizabeth A Coker, Daniel J Vis, Olivia Edwards, Emma F
Carpenter, Simonetta M Leto, Lisa Dwane, Francesco Sassi, Howard Lightfoot, Syd Barthorpe, et al. Effective drug combinations in breast, colon and
pancreatic cancer cells. Nature, 603(7899):166–173, 2022.
[55] Nishanth Ulhas Nair, Patricia Greninger, Xiaohu Zhang, Adam A Friedman,
Arnaud Amzallag, Eliane Cortez, Avinash Das Sahu, Joo Sang Lee, Anahita
Dastur, Regina K Egan, et al. A landscape of response to drug combinations
in non-small cell lung cancer. Nature Communications, 14(1):3830, 2023.
90
References
[56] Zohar B Weinstein, Andreas Bender, and Murat Cokol. Prediction of synergistic drug combinations. Current Opinion in Systems Biology, 4:24–28,
2017.
[57] Lianlian Wu, Yuqi Wen, Dongjin Leng, Qinglong Zhang, Chong Dai, Zhongming Wang, Ziqi Liu, Bowei Yan, Yixin Zhang, Jing Wang, et al. Machine
learning methods, databases and tools for drug combination prediction.
Briefings in Bioinformatics, 23(1):bbab355, 2022.
[58] Weikaixin Kong, Gianmarco Midena, Yingjia Chen, Paschalis Athanasiadis,
Tianduanyi Wang, Juho Rousu, Liye He, and Tero Aittokallio. Systematic
review of computational methods for drug combination prediction. Computational and structural biotechnology journal, 20:2807–2814, 2022.
[59] Anna Torkamannia, Yadollah Omidi, and Reza Ferdousi. A review of machine learning approaches for drug synergy prediction in cancer. Briefings
in Bioinformatics, 23(3):bbac075, 2022.
[60] Fei Zhou, Ting Yu, Ronghui Du, Guohui Fan, Ying Liu, Zhibo Liu, Jie
Xiang, Yeming Wang, Bin Song, Xiaoying Gu, et al. Clinical course and risk
factors for mortality of adult inpatients with COVID-19 in Wuhan, China:
a retrospective cohort study. The Lancet, 395(10229):1054–1062, 2020.
[61] Matthew J Cummings, Matthew R Baldwin, Darryl Abrams, Samuel D
Jacobson, Benjamin J Meyer, Elizabeth M Balough, Justin G Aaron, Jan
Claassen, LeRoy E Rabbani, Jonathan Hastie, et al. Epidemiology, clinical
course, and outcomes of critically ill adults with COVID-19 in New York
City: a prospective cohort study. The Lancet, 395(10239):1763–1770, 2020.
[62] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata.
Higher-order factorization machines. Advances in Neural Information
Processing Systems, 29, 2016.
[63] Teri A Manolio. Genomewide association studies and assessment of the
risk of disease. New England Journal of Medicine, 363(2):166–176, 2010.
[64] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of GWAS discovery:
biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017.
[65] Matthew R Nelson, Hannah Tipney, Jeffery L Painter, Judong Shen, Paola
Nicoletti, Yufeng Shen, Aris Floratos, Pak Chung Sham, Mulin Jun Li,
Junwen Wang, et al. The support of human genetic evidence for approved
drug indications. Nature Genetics, 47(8):856–860, 2015.
[66] Emily A King, J Wade Davis, and Jacob F Degner. Are drug targets with
genetic support twice as likely to be approved? revised estimates of the
impact of genetic support for drug mechanisms on the probability of drug
approval. PLoS genetics, 15(12):e1008489, 2019.
[67] Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina
De Vries, Yukinori Okada, Alicia R Martin, Hilary C Martin, Tuuli Lappalainen, and Danielle Posthuma. Genome-wide association studies. Nature
Reviews Methods Primers, 1(1):59, 2021.
[68] Eleftheria Zeggini, Anna L Gloyn, Anne C Barton, and Louise V Wain.
Translational genomics and precision medicine: Moving from the lab to the
clinic. Science, 365(6460):1409–1413, 2019.
[69] Cathryn M Lewis and Evangelos Vassos. Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 12(1):44, 2020.
91
References
[70] Tjeerd Van Der Ploeg, Peter C Austin, and Ewout W Steyerberg. Modern
modelling techniques are data hungry: a simulation study for predicting
dichotomous endpoints. BMC Medical Research Methodology, 14:1–13,
2014.
[71] Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg,
Jan Y Verbakel, and Ben Van Calster. A systematic review shows no
performance benefit of machine learning over logistic regression for clinical
prediction models. Journal of Clinical Epidemiology, 110:12–22, 2019.
[72] Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A
Rosati. Evaluating the yield of medical tests. JAMA, 247(18):2543–2546,
1982.
[73] Michael J Pencina, Ralph B D’Agostino Sr, Ralph B D’Agostino Jr, and
Ramachandran S Vasan. Evaluating the added predictive ability of a new
marker: from area under the ROC curve to reclassification and beyond.
Statistics in medicine, 27(2):157–172, 2008.
[74] Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM
Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod) the TRIPOD statement. Circulation,
131(2):211–219, 2015.
[75] Thomas R Dawber, William B Kannel, Nicholas Revotskie, Joseph
Stokes III, Abraham Kagan, and Tavia Gordon. Some factors associated
with the development of coronary heart disease—six years’ follow-up experience in the Framingham study. American Journal of Public Health and
the Nations Health, 49(10):1349–1356, 1959.
[76] William B Kannel, Thomas R Dawber, Abraham Kagan, Nicholas Revotskie, and JOSEPH STOKES III. Factors of risk in the development of
coronary heart disease—six-year follow-up experience: the Framingham
study. Annals of internal medicine, 55(1):33–50, 1961.
[77] Jeanne Truett, Jerome Cornfield, and William Kannel. A multivariate
analysis of the risk of coronary heart disease in Framingham. Journal of
chronic diseases, 20(7):511–524, 1967.
[78] Ancel Keys. Coronary heart disease in seven countries. Circulation, 41:I–
211, 1970.
[79] Charlene F Belanger, Charles H Hennekens, Bernard Rosner, Frank E
Speizer, et al. The nurses’ health study. Am J Nurs, 78(6):1039–1040, 1978.
[80] Peter WF Wilson, Ralph B D’Agostino, Daniel Levy, Albert M Belanger,
Halit Silbershatz, and William B Kannel. Prediction of coronary heart
disease using risk factor categories. Circulation, 97(18):1837–1847, 1998.
[81] Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, and Thomas J
Wang. The Framingham heart study and the epidemiology of cardiovascular
disease: a historical perspective. The Lancet, 383(9921):999–1008, 2014.
[82] Mitchell H Gail, Louise A Brinton, David P Byar, Donald K Corle, Sylvan B
Green, Catherine Schairer, and John J Mulvihill. Projecting individualized probabilities of developing breast cancer for white females who are
being examined annually. JNCI: Journal of the National Cancer Institute,
81(24):1879–1886, 1989.
[83] Julia Hippisley-Cox and Carol Coupland. Development and validation of
QDiabetes-2018 risk prediction algorithm to estimate future risk of type 2
diabetes: cohort study. BMJ, 359, 2017.
92
References
[84] Julia Hippisley-Cox and Carol Coupland. Development and validation of
risk prediction algorithms to estimate future risk of common cancers in
men and women: prospective cohort study. BMJ Open, 5(3):e007825, 2015.
[85] Erkki Vartiainen, Tiina Laatikainen, Pekka Jousilahti, Markku Peltonen,
Teemu Niiranen, and Veikko Salomaa. Sepelvaltimotaudin ja aivohalvauksen riskin arviointi FINRISKI 2.0-laskurilla. 2020.
[86] David C Goff Jr, Donald M Lloyd-Jones, Glen Bennett, Sean Coady, Ralph B
D’agostino, Raymond Gibbons, Philip Greenland, Daniel T Lackland, Daniel
Levy, Christopher J O’donnell, et al. 2013 ACC/AHA guideline on the
assessment of cardiovascular risk: a report of the American College of
Cardiology/American Heart Association Task Force on Practice Guidelines.
Circulation, 129(25_suppl_2):S49–S73, 2014.
[87] Xueli Yang, Jianxin Li, Dongsheng Hu, Jichun Chen, Ying Li, Jianfeng
Huang, Xiaoqing Liu, Fangchao Liu, Jie Cao, Chong Shen, et al. Predicting the 10-year risks of atherosclerotic cardiovascular disease in Chinese
population: the China-PAR project (prediction for ASCVD risk in china).
Circulation, 134(19):1430–1440, 2016.
Car[88] National Institute for Health and Care Excellence.
diovascular disease:
risk assessment and reduction,
including lipid modification (NICE guideline [NG238]),
2023.
https://www.nice.org.uk/guidance/ng238/chapter/Recommendationsstatinsfor-primary-prevention-of-cardiovascular-disease. Date accessed: 2024-0430.
[89] Frank LJ Visseren, François Mach, Yvo M Smulders, David Carballo, Konstantinos C Koskinas, Maria Bäck, Athanase Benetos, Alessandro Biffi,
José-Manuel Boavida, Davide Capodanno, et al. 2021 ESC Guidelines
on cardiovascular disease prevention in clinical practice: Developed by
the Task Force for cardiovascular disease prevention in clinical practice
with representatives of the European Society of Cardiology and 12 medical societies With the special contribution of the European Association of
Preventive Cardiology (EAPC). European heart journal, 42(34):3227–3337,
2021.
[90] GBD 2015 Risk Factors Collaborators et al. Global, regional, and national
comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic
analysis for the Global Burden of Disease Study 2015. Lancet (London,
England), 388(10053):1659, 2016.
[91] World Health Organization. Global spending on health 2020: weathering
the storm. 2020.
[92] John Robson, Isabel Dostal, Aziz Sheikh, Sandra Eldridge, Vichithranie
Madurasinghe, Chris Griffiths, Carol Coupland, and Julia Hippisley-Cox.
The NHS Health Check in England: an evaluation of the first 4 years. BMJ
Open, 6(1):e008840, 2016.
[93] NHS Health Check. [https://www.nhs.uk/conditions/nhs-health-check/].
Accessed September 16th 2024. , 2024.
[94] Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T
Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau,
Jared O’Connell, et al. The UK Biobank resource with deep phenotyping
and genomic data. Nature, 562(7726):203–209, 2018.
93
References
[95] Mitja I Kurki, Juha Karjalainen, Priit Palta, Timo P Sipilä, Kati Kristiansson, Kati M Donner, Mary P Reeve, Hannele Laivuori, Mervi Aavikko,
Mari A Kaunisto, et al. Finngen provides genetic insights from a wellphenotyped isolated population. Nature, 613(7944):508–518, 2023.
[96] Mayo Blegen Ashley L. 18 Wirkus Samantha J. 18 Wagner Victoria A. 18
Meyer Jeffrey G. 18 Cicek Mine S. 10 18 Biobank and All of Us Research
Demonstration Project Teams Choi Seung Hoan 14 http://orcid. org/00000002-0322-8970 Wang Xin 14 http://orcid. org/0000 0001-6042-4487 Rosenthal Elisabeth A. 15. Genomic data in the All of Us research program.
Nature, 627(8003):340–346, 2024.
[97] Iain S Forrest, Ben O Petrazzini, áine Duffy, Joshua K Park, Carla MarquezLuna, Daniel M Jordan, Ghislain Rocheleau, Judy H Cho, Robert S Rosenson, Jagat Narula, et al. Machine learning-based marker for coronary
artery disease: derivation and validation in two longitudinal cohorts. The
Lancet, 401(10372):215–225, 2023.
[98] Ben O Petrazzini, Kumardeep Chaudhary, Carla Márquez-Luna, Iain S
Forrest, Ghislain Rocheleau, Judy Cho, Jagat Narula, Girish Nadkarni, and
Ron Do. Coronary risk estimation based on clinical data in electronic health
records. Journal of the American College of Cardiology, 79(12):1155–1166,
2022.
[99] Ruowang Li, Yong Chen, Marylyn D Ritchie, and Jason H Moore. Electronic
health records and polygenic risk scores for predicting disease risk. Nature
Reviews Genetics, 21(8):493–502, 2020.
[100] Michael Inouye, Gad Abraham, Christopher P Nelson, Angela M Wood,
Michael J Sweeting, Frank Dudbridge, Florence Y Lai, Stephen Kaptoge,
Marta Brozynska, Tingting Wang, et al. Genomic risk prediction of coronary
artery disease in 480,000 adults: implications for primary prevention.
Journal of the American College of Cardiology, 72(16):1883–1893, 2018.
[101] Gad Abraham, Rainer Malik, Ekaterina Yonova-Doing, Agus Salim, Tingting Wang, John Danesh, Adam S Butterworth, Joanna MM Howson,
Michael Inouye, and Martin Dichgans. Genomic risk score offers predictive
performance comparable to clinical risk factors for ischaemic stroke. Nature
Communications, 10(1):5819, 2019.
[102] Atlas Khan, Michael C Turchin, Amit Patki, Vinodh Srinivasasainagendra,
Ning Shang, Rajiv Nadukuru, Alana C Jones, Edyta Malolepsza, Ozan
Dikilitas, Iftikhar J Kullo, et al. Genome-wide polygenic score to predict
chronic kidney disease across ancestries. Nature Medicine, 28(7):1412–1420,
2022.
[103] Tian Ge, Marguerite R Irvin, Amit Patki, Vinodh Srinivasasainagendra,
Yen-Feng Lin, Hemant K Tiwari, Nicole D Armstrong, Barbara Benoit, ChiaYen Chen, Karmel W Choi, et al. Development and validation of a transancestry polygenic risk score for type 2 diabetes in diverse populations.
Genome Medicine, 14(1):70, 2022.
[104] Max Tamlander, Bradley Jermy, Toni T Seppälä, Martti Färkkilä, FinnGen,
Elisabeth Widén, Samuli Ripatti, and Nina Mars. Genome-wide polygenic
risk scores for colorectal cancer have implications for risk-based screening.
British Journal of Cancer, 130(4):651–659, 2024.
[105] Xin Yang, Siddhartha Kar, Antonis C Antoniou, and Paul DP Pharoah.
Polygenic scores in cancer. Nature reviews Cancer, 23(9):619–630, 2023.
94
References
[106] Rayjean J Hung, Matthew T Warkentin, Yonathan Brhane, Nilanjan Chatterjee, David C Christiani, Maria Teresa Landi, Neil E Caporaso, Geoffrey
Liu, Mattias Johansson, Demetrius Albanes, et al. Assessing lung cancer
absolute risk trajectory based on a polygenic risk model. Cancer Research,
81(6):1607–1615, 2021.
[107] Nasim Mavaddat, Kyriaki Michailidou, Joe Dennis, Michael Lush, Laura
Fachal, Andrew Lee, Jonathan P Tyrer, Ting-Huei Chen, Qin Wang, Manjeet K Bolla, et al. Polygenic risk scores for prediction of breast cancer
and breast cancer subtypes. The American Journal of Human Genetics,
104(1):21–34, 2019.
[108] Genevieve L Wojcik, Mariaelisa Graff, Katherine K Nishimura, Ran Tao,
Jeffrey Haessler, Christopher R Gignoux, Heather M Highland, Yesha M
Patel, Elena P Sorokin, Christy L Avery, et al. Genetic analyses of diverse
populations improves discovery for complex traits. Nature, 570(7762):514–
518, 2019.
[109] Julia Carrasco-Zanini, Maik Pietzner, Jonathan Davitte, Praveen Surendran, Damien C Croteau-Chonka, Chloe Robins, Ana Torralbo, Christopher
Tomlinson, Florian Grünschläger, Natalie Fitzpatrick, et al. Proteomic
signatures improve risk prediction for common and rare diseases. Nature
Medicine, pages 1–10, 2024.
[110] Jia You, Yu Guo, Yi Zhang, Ju-Jiao Kang, Lin-Bo Wang, Jian-Feng Feng,
Wei Cheng, and Jin-Tai Yu. Plasma proteomic profiles predict individual
future health risk. Nature Communications, 14(1):7817, 2023.
[111] Danni A Gadd, Robert F Hillary, Zhana Kuncheva, Tasos Mangelis, Yipeng
Cheng, Manju Dissanayake, Romi Admanit, Jake Gagnon, Tinchi Lin,
Kyle L Ferber, et al. Blood protein assessment of leading incident diseases
and mortality in the UK Biobank. Nature Aging, pages 1–10, 2024.
[112] Oliver Fiehn. Metabolomics—the link between genotypes and phenotypes.
Functional genomics, pages 155–171, 2002.
[113] Elaine Holmes, Ruey Leng Loo, Jeremiah Stamler, Magda Bictash, Ivan KS
Yap, Queenie Chan, Tim Ebbels, Maria De Iorio, Ian J Brown, Kirill A
Veselkov, et al. Human metabolic phenotype diversity and its association
with diet and blood pressure. Nature, 453(7193):396–400, 2008.
[114] Aifric O’Sullivan, Michael J Gibney, and Lorraine Brennan. Dietary intake
patterns are reflected in metabolomic profiles: potential role in dietary
assessment studies. The American journal of clinical nutrition, 93(2):314–
321, 2011.
[115] Rima Kaddurah-Daouk, Bruce S Kristal, and Richard M Weinshilboum.
Metabolomics: a global biochemical approach to drug response and disease.
Annu. Rev. Pharmacol. Toxicol., 48(1):653–683, 2008.
[116] David S Wishart. Metabolomics for investigating physiological and pathophysiological processes. Physiological reviews, 99(4):1819–1875, 2019.
[117] David S Wishart, AnChi Guo, Eponine Oler, Fei Wang, Afia Anjum, Harrison
Peters, Raynard Dizon, Zinat Sayeeda, Siyang Tian, Brian L Lee, et al.
HMDB 5.0: the human metabolome database for 2022. Nucleic acids
research, 50(D1):D622–D631, 2022.
[118] Aihua Zhang, Hui Sun, Ping Wang, Ying Han, and Xijun Wang. Recent and
potential developments of biofluid analyses in metabolomics. Journal of
proteomics, 75(4):1079–1088, 2012.
95
References
[119] Abdul-Hamid M Emwas. The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics
research. Metabonomics: Methods and protocols, pages 161–193, 2015.
[120] John L Markley, Rafael Brüschweiler, Arthur S Edison, Hamid R Eghbalnia,
Robert Powers, Daniel Raftery, and David S Wishart. The future of NMRbased metabolomics. Current opinion in biotechnology, 43:34–40, 2017.
[121] Jeramie D Watrous, Mir Henglin, Brian Claggett, Kim A Lehmann, Martin G Larson, Susan Cheng, and Mohit Jain. Visualization, quantification,
and alignment of spectral drift in population scale untargeted metabolomics
data. Analytical chemistry, 89(3):1399–1404, 2017.
[122] GA Nagana Gowda and Danijel Djukovic. Overview of mass spectrometrybased metabolomics: opportunities and challenges. Mass Spectrometry in
Metabolomics: Methods and Protocols, pages 3–12, 2014.
[123] Amanda Rundblad, Jacob J Christensen, Kristin S Hustad, Nasser E Bastani, Inger Ottestad, Kirsten B Holven, and Stine M Ulven. Associations
between dietary intake and glucose tolerance in clinical and metabolomicsbased metabotypes. Genes & nutrition, 18(1):3, 2023.
[124] Yu Xu, Scott C Ritchie, Yujian Liang, Paul RHJ Timmers, Maik Pietzner,
Loïc Lannelongue, Samuel A Lambert, Usman A Tahir, Sebastian MayWilson, Carles Foguet, et al. An atlas of genetic scores to predict multi-omic
traits. Nature, 616(7955):123–131, 2023.
[125] Minna K Karjalainen, Savita Karthikeyan, Clare Oliver-Williams, Eeva
Sliz, Elias Allara, Wing Tung Fung, Praveen Surendran, Weihua Zhang,
Pekka Jousilahti, Kati Kristiansson, et al. Genome-wide characterization
of circulating metabolic biomarkers. Nature, 628(8006):130–138, 2024.
[126] Luca A Lotta, Maik Pietzner, Isobel D Stewart, Laura BL Wittemans,
Chen Li, Roberto Bonelli, Johannes Raffler, Emma K Biggs, Clare OliverWilliams, Victoria PW Auyeung, et al. Cross-platform genetic discovery of
small molecule products of metabolism and application to clinical outcomes.
Nature Genetics, 53(1):54, 2021.
[127] Johannes Kettunen, Ayşe Demirkan, Peter Würtz, Harmen HM Draisma,
Toomas Haller, Rajesh Rawal, Anika Vaarhorst, Antti J Kangas, Leo-Pekka
Lyytikäinen, Matti Pirinen, et al. Genome-wide study for circulating
metabolites identifies 62 loci and reveals novel systemic effects of LPA.
Nature Communications, 7(1):11122, 2016.
[128] Joshua A Bell, Tom G Richardson, Qin Wang, Eleanor Sanderson, Tom
Palmer, Venexia Walker, Linda M O’Keeffe, Nicholas J Timpson, Anna
Cichonska, Heli Julkunen, et al. Effects of general and central adiposity
on circulating lipoprotein, lipid, and metabolite levels in UK Biobank: a
multivariable Mendelian randomization study. The Lancet Regional Health–
Europe, 21, 2022.
[129] Jiarui Mi, Lingjuan Jiang, Zhengye Liu, Xia Wu, Nan Zhao, Yuanzhuo
Wang, and Xiaoyin Bai. Identification of blood metabolites linked to the
risk of cholelithiasis: a comprehensive Mendelian randomization study.
Hepatology International, 16(6):1484–1493, 2022.
[130] Maria Carolina Borges, Philip C Haycock, Jie Zheng, Gibran Hemani,
Michael V Holmes, George Davey Smith, Aroon D Hingorani, and Deborah A
Lawlor. Role of circulating polyunsaturated fatty acids on cardiovascular
diseases risk: analysis using Mendelian randomization and fatty acid
genetic association data from over 114,000 UK Biobank participants. BMC
Medicine, 20(1):210, 2022.
96
References
[131] Thomas J Wang, Martin G Larson, Ramachandran S Vasan, Susan Cheng,
Eugene P Rhee, Elizabeth McCabe, Gregory D Lewis, Caroline S Fox,
Paul F Jacques, Céline Fernandez, et al. Metabolite profiles and the risk of
developing diabetes. Nature Medicine, 17(4):448–453, 2011.
[132] Anna Floegel, Norbert Stefan, Zhonghao Yu, Kristin Mühlenbruch, Dagmar
Drogan, Hans-Georg Joost, Andreas Fritsche, Hans-Ulrich Häring, Martin
Hrabě de Angelis, Annette Peters, et al. Identification of serum metabolites associated with risk of type 2 diabetes using a targeted metabolomic
approach. Diabetes, 62(2):639–648, 2013.
[133] Peter Würtz, Pasi Soininen, Antti J Kangas, Tapani Rönnemaa, Terho
Lehtimäki, Mika Kähönen, Jorma S Viikari, Olli T Raitakari, and Mika
Ala-Korpela. Branched-chain and aromatic amino acids are predictors of
insulin resistance in young adults. Diabetes Care, 36(3):648–655, 2013.
[134] Svati H Shah, William E Kraus, and Christopher B Newgard. Metabolomic
profiling for the identification of novel biomarkers and mechanisms related to common cardiovascular diseases: form and function. Circulation,
126(9):1110–1120, 2012.
[135] Christin Stegemann, Raimund Pechlaner, Peter Willeit, Sarah R Langley,
Massimo Mangino, Ursula Mayr, Cristina Menni, Alireza Moayyeri, Peter
Santer, Gregor Rungger, et al. Lipidomics profiling and risk of cardiovascular disease in the prospective population-based Bruneck study. Circulation,
129(18):1821–1831, 2014.
[136] Peter Würtz, Juho R Raiko, Costan G Magnussen, Pasi Soininen,
Antti J Kangas, Tuulia Tynkkynen, Russell Thomson, Reino Laatikainen,
Markku J Savolainen, Jari Laurikka, et al. High-throughput quantification
of circulating metabolites improves prediction of subclinical atherosclerosis.
European heart journal, 33(18):2307–2316, 2012.
[137] Peter Würtz, Aki S Havulinna, Pasi Soininen, Tuulia Tynkkynen, David
Prieto-Merino, Therese Tillin, Anahita Ghorbani, Anna Artati, Qin Wang,
Mika Tiainen, et al. Metabolite profiling and cardiovascular event risk: a
prospective study of 3 population-based cohorts. Circulation, 131(9):774–
785, 2015.
[138] Ari V Ahola-Olli, Linda Mustelin, Maria Kalimeri, Johannes Kettunen,
Jari Jokelainen, Juha Auvinen, Katri Puukka, Aki S Havulinna, Terho
Lehtimäki, Mika Kähönen, et al. Circulating metabolites and the risk
of type 2 diabetes: a prospective study of 11,896 young adults from four
Finnish cohorts. Diabetologia, 62:2298–2309, 2019.
[139] MC Borges, AF Schmidt, B Jefferis, SG Wannamethee, DA Lawlor, M Kivimaki, et al. Circulating fatty acids and risk of coronary heart disease
and stroke: individual participant data meta-analysis in up to 16 126
participants. Journal of the American Heart Association, 2020.
[140] Emmi Tikkanen, Vilma Jägerroos, Michael V Holmes, Naveed Sattar,
Mika Ala-Korpela, Pekka Jousilahti, Annamari Lundqvist, Markus Perola, Veikko Salomaa, and Peter Würtz. Metabolic biomarker discovery for
risk of peripheral artery disease compared with coronary artery disease:
lipoprotein and metabolite profiling of 31 657 individuals from 5 prospective
cohorts. Journal of the American Heart Association, 10(23):e021995, 2021.
[141] Michael V Holmes, Iona Y Millwood, Christiana Kartsonaki, Michael R Hill,
Derrick A Bennett, Ruth Boxall, Yu Guo, Xin Xu, Zheng Bian, Ruying Hu,
et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction
and stroke. Journal of The American college of cardiology, 71(6):620–632,
2018.
97
References
[142] Krista Fischer, Johannes Kettunen, Peter Würtz, Toomas Haller, Aki S
Havulinna, Antti J Kangas, Pasi Soininen, Tonu Esko, Mari-Liis Tammesoo,
Reedik Mägi, et al. Biomarker profiling by nuclear magnetic resonance
spectroscopy for the prediction of all-cause mortality: an observational
study of 17,345 persons. PLoS Medicine, 11(2):e1001606, 2014.
[143] Scott C Ritchie, Peter Würtz, Artika P Nath, Gad Abraham, Aki S
Havulinna, Liam G Fearnley, Antti-Pekka Sarin, Antti J Kangas, Pasi
Soininen, Kristiina Aalto, et al. The biomarker GlycA is associated with
chronic inflammation and predicts long-term risk of severe infection. Cell
Systems, 1(4):293–301, 2015.
[144] Johannes Kettunen, Scott C Ritchie, Olga Anufrieva, Leo-Pekka Lyytikäinen, Jussi Hernesniemi, Pekka J Karhunen, Pekka Kuukasjärvi, Jari
Laurikka, Mika Kähönen, Terho Lehtimäki, et al. Biomarker glycoprotein
acetyls is associated with the risk of a wide spectrum of incident diseases
and stratifies mortality risk in angiography patients. Circulation: Genomic
and Precision Medicine, 11(11):e002234, 2018.
[145] Juho Tynkkynen, Vincent Chouraki, Sven J van der Lee, Jussi Hernesniemi,
Qiong Yang, Shuo Li, Alexa Beiser, Martin G Larson, Katri Sääksjärvi,
Martin J Shipley, et al. Association of branched-chain amino acids and
other circulating metabolites with risk of incident dementia and Alzheimer’s
disease: a prospective study in eight cohorts. Alzheimer’s & Dementia,
14(6):723–733, 2018.
[146] Sven J van der Lee, Charlotte E Teunissen, René Pool, Martin J Shipley,
Alexander Teumer, Vincent Chouraki, Debora Melo van Lent, Juho Tynkkynen, Krista Fischer, Jussi Hernesniemi, et al. Circulating metabolites and
general cognitive ability and dementia: Evidence from 11 cohort studies.
Alzheimer’s & Dementia, 14(6):707–722, 2018.
[147] Lucie Lécuyer, Agnès Victor Bala, Mélanie Deschasaux, Nadia Bouchemal,
Mohamed Nawfal Triba, Marie-Paule Vasson, Adrien Rossary, Aicha Demidem, Pilar Galan, Serge Hercberg, et al. NMR metabolomic signatures
reveal predictive plasma metabolites associated with long-term risk of developing breast cancer. International Journal of Epidemiology, 47(2):484–494,
2018.
[148] Päivi Sirniö, Juha P Väyrynen, Kai Klintrup, Jyrki Mäkelä, Markus J
Mäkinen, Tuomo J Karttunen, and Anne Tuomisto. Decreased serum
apolipoprotein A1 levels are associated with poor survival and systemic
inflammatory response in colorectal cancer. Scientific Reports, 7(1):5374,
2017.
[149] Jesse Fest, Lisanne S Vijfhuizen, Jelle J Goeman, Olga Veth, Anni Joensuu, Markus Perola, Satu Männistö, Eivind Ness-Jensen, Kristian Hveem,
Toomas Haller, et al. Search for early pancreatic cancer blood biomarkers in five European prospective population biobanks using metabolomics.
Endocrinology, 160(7):1731–1742, 2019.
[150] Jared R Mayers, Chen Wu, Clary B Clish, Peter Kraft, Margaret E Torrence,
Brian P Fiske, Chen Yuan, Ying Bao, Mary K Townsend, Shelley S Tworoger,
et al. Elevation of circulating branched-chain amino acids is an early
event in human pancreatic adenocarcinoma development. Nature Medicine,
20(10):1193–1198, 2014.
[151] Thore Buergel, Jakob Steinfeldt, Greg Ruyoga, Maik Pietzner, Daniele
Bizzarri, Dina Vojinovic, Julius Upmeier zu Belzen, Lukas Loock, Paul
Kittner, Lara Christmann, et al. Metabolomic profiles predict individual
multidisease outcomes. Nature Medicine, 28(11):2309–2320, 2022.
98
References
[152] Fiona Bragg, Eirini Trichia, Diego Aguilar-Ramirez, Jelena Bešević, Sarah
Lewington, and Jonathan Emberson. Predictive value of circulating NMR
metabolic biomarkers for type 2 diabetes risk in the UK Biobank study.
BMC Medicine, 20(1):159, 2022.
[153] Xinyu Zhang, Wenyi Hu, Yueye Wang, Wei Wang, Huan Liao, Xiayin Zhang,
Katerina V Kiburg, Xianwen Shang, Gabriella Bulloch, Yu Huang, et al.
Plasma metabolomic profiles of dementia: a prospective study of 110,655
participants in the UK Biobank. BMC Medicine, 20(1):252, 2022.
[154] Yi-Xuan Qiang, Jia You, Xiao-Yu He, Yu Guo, Yue-Ting Deng, Pei-Yang Gao,
Xin-Rui Wu, Jian-Feng Feng, Wei Cheng, and Jin-Tai Yu. Plasma metabolic
profiles predict future dementia and dementia subtypes: a prospective
analysis of 274,160 participants. Alzheimer’s Research & Therapy, 16(1):16,
2024.
[155] Rafael R Oexner, Hyunchan Ahn, Konstantinos Theofilatos, Ravi A Shah,
Robin Schmitt, Philip Chowienczyk, Anna Zoccarato, and Ajay M Shah.
Serum metabolomics improves risk stratification for incident heart failure.
European Journal of Heart Failure, 26(4):829–840, 2024.
[156] Zhening Liu, Hangkai Huang, Jiarong Xie, Yingying Xu, and Chengfu
Xu. Circulating fatty acids and risk of hepatocellular carcinoma and
chronic liver disease mortality in the UK Biobank. Nature Communications,
15(1):3707, 2024.
[157] Wenyi Hu, Wei Wang, Huan Liao, Gabriella Bulloch, Xiayin Zhang, Xianwen Shang, Yu Huang, Yijun Hu, Honghua Yu, Xiaohong Yang, et al.
Metabolic profiling reveals circulating biomarkers associated with incident
and prevalent Parkinson’s disease. npj Parkinson’s Disease, 10(1):130, 2024.
[158] Shiyu Zhang, Zheng Wang, Yijing Wang, Yixiao Zhu, Qiao Zhou, Xingxing
Jian, Guihu Zhao, Jian Qiu, Kun Xia, Beisha Tang, et al. A metabolomic
profile of biological aging in 250,341 individuals from the UK Biobank.
Nature Communications, 15(1):8081, 2024.
[159] Eva S Istvan and Johann Deisenhofer. Structural mechanism for statin
inhibition of HMG-CoA reductase. Science, 292(5519):1160–1164, 2001.
[160] Peter Imming, Christian Sinning, and Achim Meyer. Drugs, their targets
and the nature and number of drug targets. Nature Reviews Drug Discovery,
5(10):821–834, 2006.
[161] Andrew Anighoro, Jurgen Bajorath, and Giulio Rastelli. Polypharmacology:
challenges and opportunities in drug discovery: miniperspective. Journal
of medicinal chemistry, 57(19):7874–7887, 2014.
[162] Grant R Zimmermann, Joseph Lehar, and Curtis T Keith. Multi-target
therapeutics: when the whole is greater than the sum of the parts. Drug
Discovery Today, 12(1-2):34–42, 2007.
[163] Zachary A Knight, Henry Lin, and Kevan M Shokat. Targeting the cancer
kinome through polypharmacology. Nature Reviews Cancer, 10(2):130–137,
2010.
[164] Pranita D Tamma, Sara E Cosgrove, and Lisa L Maragakis. Combination
therapy for treatment of infections with gram-negative bacteria. Clinical
microbiology reviews, 25(3):450–470, 2012.
[165] Roberta J Worthington and Christian Melander. Combination approaches
to combat multidrug-resistant bacteria. Trends in biotechnology, 31(3):177–
184, 2013.
99
References
[166] Tea Pemovska, Johannes W Bigenzahn, and Giulio Superti-Furga. Recent
advances in combinatorial drug screening and synergy scoring. Current
opinion in pharmacology, 42:102–110, 2018.
[167] Alan Sandler, Robert Gray, Michael C Perry, Julie Brahmer, Joan H Schiller,
Afshin Dowlati, Rogerio Lilenbaum, and David H Johnson. Paclitaxel–
carboplatin alone or with bevacizumab for non–small-cell lung cancer. New
England Journal of Medicine, 355(24):2542–2550, 2006.
[168] M Reck, J Von Pawel, P von Zatloukal, R Ramlau, V Gorbounova, V Hirsh,
N Leighl, J Mezger, V Archer, N Moore, et al. Overall survival with
cisplatin–gemcitabine and bevacizumab or placebo as first-line therapy
for nonsquamous non-small-cell lung cancer: results from a randomised
phase iii trial (AVAiL). Annals of Oncology, 21(9):1804–1809, 2010.
[169] James Larkin, Paolo A Ascierto, Brigitte Dréno, Victoria Atkinson,
Gabriella Liszkay, Michele Maio, Mario Mandalà, Lev Demidov, Daniil
Stroyakovskiy, Luc Thomas, et al. Combined vemurafenib and cobimetinib in BRAF-mutated melanoma. New England Journal of Medicine,
371(20):1867–1876, 2014.
[170] Joseph Lehár, Andrew S Krueger, William Avery, Adrian M Heilbut, Lisa M
Johansen, E Roydon Price, Richard J Rickles, Glenn F Short Iii, Jane E
Staunton, Xiaowei Jin, et al. Synergistic drug combinations tend to improve
therapeutically relevant selectivity. Nature Biotechnology, 27(7):659–666,
2009.
[171] Jonathan B Fitzgerald, Birgit Schoeberl, Ulrik B Nielsen, and Peter K
Sorger. Systems biology and combination therapy in the quest for clinical
efficacy. Nature chemical biology, 2(9):458–466, 2006.
[172] Sreenath V Sharma, Daniel A Haber, and Jeff Settleman. Cell line-based
platforms to evaluate the therapeutic efficacy of candidate anticancer
agents. Nature reviews cancer, 10(4):241–253, 2010.
[173] Jean-Pierre Gillet, Sudhir Varma, and Michael M Gottesman. The clinical
relevance of cancer cell lines. Journal of the National Cancer Institute,
105(7):452–458, 2013.
[174] Christiaan Klijn, Steffen Durinck, Eric W Stawiski, Peter M Haverty,
Zhaoshi Jiang, Hanbin Liu, Jeremiah Degenhardt, Oleg Mayba, Florian
Gnad, Jinfeng Liu, et al. A comprehensive transcriptional portrait of human
cancer cell lines. Nature Biotechnology, 33(3):306–312, 2015.
[175] Berenbaum Me. What is synergy. Pharmacoligal Reviews, 41:93–141, 1989.
[176] CI Bliss. The toxicity of poisons applied jointly. Annals of Applied Biology,
26(3):585–615, 1939.
[177] S Loewe. The problem of synergism and antagonism of combined drugs.
Arzneimittelforschung, 3:285–290, 1953.
[178] Seyed Ali Madani Tonekaboni, Laleh Soltan Ghoraie, Venkata Satya Kumar Manem, and Benjamin Haibe-Kains. Predictive approaches for drug
combination discovery in cancer. Briefings in Bioinformatics, 19(2):263–276,
2018.
[179] J Yang, H Tang, Y Li, R Zhong, T Wang, STC Wong, G Xiao, and Y Xie.
DIGRE: Drug-induced genomic residual effect model for successful prediction of multidrug effects. CPT: Pharmacometrics & Systems Pharmacology,
4(2):91–97, 2015.
100
References
[180] JH Lee, DG Kim, TJ Bae, K Rho, JT Kim, et al. CDA: Combinatorial drug
discovery using transcriptional response modules. PLoS ONE, 7(8), 2012.
[181] Ralph G Zinner, Brittany L Barrett, Elmira Popova, Paul Damien, Andrei Y
Volgin, Juri G Gelovani, Reuben Lotan, Hai T Tran, Claudio Pisano, Gordon B Mills, et al. Algorithmic guided screening of drug combinations of
arbitrary size for activity against cancer cells. Molecular cancer therapeutics, 8(3):521–532, 2009.
[182] Pak Kin Wong, Fuqu Yu, Arash Shahangian, Genhong Cheng, Ren Sun, and
Chih-Ming Ho. Closed-loop control of cellular functions using combinatory
drugs guided by a stochastic search algorithm. Proceedings of the National
Academy of Sciences, 105(13):5105–5110, 2008.
[183] Zikai Wu, Xing-Ming Zhao, and Luonan Chen. A systems biology approach
to identify effective cocktail drugs. In BMC Systems Biology, volume 4,
pages 1–14. Springer, 2010.
[184] Kelly E Regan-Fendt, Jielin Xu, Mallory DiVincenzo, Megan C Duggan,
Reena Shakya, Ryejung Na, William E Carson III, Philip RO Payne, and
Fuhai Li. Synergy from gene expression and network mining (syngenet)
method predicts synergistic drug combinations for diverse melanoma genomic subtypes. NPJ systems biology and applications, 5(1):6, 2019.
[185] Feixiong Cheng, István A Kovács, and Albert-László Barabási. Networkbased prediction of drug combinations. Nature Communications, 10(1):1197,
2019.
[186] Bulat Zagidullin, Jehad Aldahdooh, Shuyu Zheng, Wenyu Wang, Yinyin
Wang, Joseph Saad, Alina Malyutina, Mohieddin Jafari, Ziaurrehman
Tanoli, Alberto Pessia, et al. DrugComb: an integrative cancer drug combination data portal. Nucleic acids research, 47(W1):W43–W51, 2019.
[187] Heewon Seo, Denis Tkachuk, Chantal Ho, Anthony Mammoliti, Aria Rezaie,
Seyed Ali Madani Tonekaboni, and Benjamin Haibe-Kains. SYNERGxDB:
an integrative pharmacogenomic portal to identify synergistic drug combinations for precision oncology. Nucleic acids research, 48(W1):W494–W501,
2020.
[188] Jennifer O’Neil, Yair Benita, Igor Feldman, Melissa Chenard, Brian Roberts,
Yaping Liu, Jing Li, Astrid Kral, Serguei Lejnine, Andrey Loboda, et al. An
unbiased oncology compound screen to identify novel combination strategies.
Molecular cancer therapeutics, 15(6):1155–1162, 2016.
[189] Yifan Sun, Yi Xiong, Qian Xu, and Dongqing Wei. A hadoop-based method to
predict potential effective drug combination. BioMed research international,
2014(1):196858, 2014.
[190] Hongyang Li, Tingyang Li, Daniel Quang, and Yuanfang Guan. Network propagation predicts drug synergy in cancers. Cancer Research,
78(18):5446–5457, 2018.
[191] Kaitlyn M Gayvert, Omar Aly, James Platt, Marcus W Bosenberg, David F
Stern, and Olivier Elemento. A computational approach for identifying
synergistic drug combinations. PLoS computational biology, 13(1):e1005308,
2017.
[192] Pavel Sidorov, Stefan Naulaerts, Jérémy Ariey-Bonnet, Eddy Pasquier, and
Pedro J Ballester. Predicting synergism of cancer drug combinations using
NCI-ALMANAC data. Frontiers in chemistry, 7:509, 2019.
101
References
[193] Remzi Celebi, Oliver Bear Don’t Walk IV, Rajiv Movva, Semih Alpsoy,
and Michel Dumontier. In-silico prediction of synergistic anti-cancer drug
combinations using multi-omics data. Scientific Reports, 9(1):8949, 2019.
[194] Jian-Yu Shi, Jia-Xin Li, Ke Gao, Peng Lei, and Siu-Ming Yiu. Predicting
combinative drug pairs towards realistic screening via integrating heterogeneous features. BMC Bioinformatics, 18:1–9, 2017.
[195] Fangfang Xia, Maulik Shukla, Thomas Brettin, Cristina Garcia-Cardona,
Judith Cohn, Jonathan E Allen, Sergei Maslov, Susan L Holbeck, James H
Doroshow, Yvonne A Evrard, et al. Predicting tumor cell line response to
drug pairs with deep learning. BMC bioinformatics, 19:71–79, 2018.
[196] Peiran Jiang, Shujun Huang, Zhenyuan Fu, Zexuan Sun, Ted M Lakowski,
and Pingzhao Hu. Deep graph embedding for prioritizing synergistic anticancer drug combinations. Computational and structural biotechnology
journal, 18:427–438, 2020.
[197] Yejin Kim, Shuyu Zheng, Jing Tang, Wenjin Jim Zheng, Zhao Li, and
Xiaoqian Jiang. Anticancer drug synergy prediction in understudied tissues
using transfer learning. Journal of the American Medical Informatics
Association, 28(1):42–51, 2021.
[198] Qiao Liu and Lei Xie. TranSynergy: Mechanism-driven interpretable deep
neural network for the synergistic prediction and pathway deconvolution of
drug combinations. PLoS computational biology, 17(2):e1008653, 2021.
[199] Halil Ibrahim Kuru, Oznur Tastan, and A Ercument Cicek. MatchMaker: a
deep learning framework for drug synergy prediction. IEEE/ACM transactions on computational biology and bioinformatics, 19(4):2334–2344, 2021.
[200] Jinxian Wang, Xuejun Liu, Siyuan Shen, Lei Deng, and Hui Liu. DeepDDS:
deep graph neural network with attention mechanism to predict synergistic
drug combinations. Briefings in Bioinformatics, 23(1):bbab390, 2022.
[201] Tianyu Zhang, Liwei Zhang, Philip RO Payne, and Fuhai Li. Synergistic
drug combination prediction by integrating multiomics data in deep learning models. Translational bioinformatics for therapeutic development, pages
223–238, 2021.
[202] Kristina Preuer, Richard PI Lewis, Sepp Hochreiter, Andreas Bender, Krishna C Bulusu, and Günter Klambauer. DeepSynergy: predicting anticancer drug synergy with deep learning. Bioinformatics, 34(9):1538–1546,
2018.
[203] Kunjie Fan, Lijun Cheng, and Lang Li. Artificial intelligence and machine learning methods in predicting anti-cancer drug combination effects.
Briefings in Bioinformatics, 22(6):bbab271, 2021.
[204] Delora Baptista, Pedro G Ferreira, and Miguel Rocha. Deep learning for
drug response prediction in cancer. Briefings in Bioinformatics, 22(1):360–
379, 2021.
[205] Yurui Chen and Louxin Zhang. How much can deep learning improve
prediction of the responses to drugs in cancer cell lines? Briefings in
Bioinformatics, 23(1):bbab378, 2022.
[206] Michael P Menden, Dennis Wang, Mike J Mason, Bence Szalai, Krishna C
Bulusu, Yuanfang Guan, Thomas Yu, Jaewoo Kang, Minji Jeon, Russ Wolfinger, et al. Community assessment to advance computational prediction of
cancer drug combinations in a pharmacogenomic screen. Nature Communications, 10(1):2674, 2019.
102
References
[207] Mukesh Bansal, Jichen Yang, Charles Karan, Michael P Menden, James C
Costello, Hao Tang, Guanghua Xiao, Yajuan Li, Jeffrey Allen, Rui Zhong,
et al. A community computational challenge to predict the activity of pairs
of compounds. Nature Biotechnology, 32(12):1213–1222, 2014.
[208] Julio Saez-Rodriguez, James C Costello, Stephen H Friend, Michael R
Kellen, Lara Mangravite, Pablo Meyer, Thea Norman, and Gustavo
Stolovitzky. Crowdsourcing biomedical research: leveraging communities
as innovation engines. Nature Reviews Genetics, 17(8):470–486, 2016.
[209] Abraham Wald. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American
Mathematical society, 54(3):426–482, 1943.
[210] David A Grimes and Kenneth F Schulz. Bias and causal associations in
observational research. The Lancet, 359(9302):248–252, 2002.
[211] Philip Sedgwick. Bias in observational study designs: prospective cohort
studies. BMJ, 349, 2014.
[212] Galit Shmueli. To explain or to predict? Statistical Science, 25(3):289–310,
2010.
[213] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation
for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[214] Robert Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society Series B: Statistical Methodology,
58(1):267–288, 1996.
[215] Steffen Rendle. Factorization machines. In 2010 IEEE International conference on data mining, pages 995–1000. IEEE, 2010.
[216] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme. Fast context-aware recommendations with factorization machines.
In Proceedings of the 34th international ACM SIGIR conference on Research
and development in Information Retrieval, pages 635–644, 2011.
[217] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. Fieldaware factorization machines for CTR prediction. In Proceedings of the 10th
ACM conference on recommender systems, pages 43–50, 2016.
[218] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan,
Yu Sun, and Quan Lu. Field-weighted factorization machines for clickthrough rate prediction in display advertising. In Proceedings of the 2018
world wide web conference, pages 1349–1357, 2018.
[219] Zhulin Tao, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua.
HoAFM: a high-order attentive factorization machine for ctr prediction.
Information Processing & Management, 57(6):102076, 2020.
[220] Giulia Grazia, Ilaria Penna, Valentina Perotti, Andrea Anichini, and Elena
Tassi. Towards combinatorial targeted therapy in melanoma: from preclinical evidence to clinical application. International journal of oncology,
45(3):929–949, 2014.
[221] Amila Suraweera, Kenneth J O’Byrne, and Derek J Richard. Combination
therapy with histone deacetylase inhibitors (HDACi) for the treatment of
cancer: achieving the full therapeutic potential of HDACi. Frontiers in
oncology, 8:92, 2018.
103
References
[222] Frederick K Ho, Carlos A Celis-Morales, Stuart R Gray, S Vittal Katikireddi,
Claire L Niedzwiedz, Claire Hastie, Lyn D Ferguson, Colin Berry, Daniel F
Mackay, Jason MR Gill, et al. Modifiable and non-modifiable risk factors
for covid-19, and comparison to risk factors for influenza and pneumonia: results from a UK Biobank prospective cohort study. BMJ Open,
10(11):e040402, 2020.
[223] Christy L Avery, Annie Green Howard, Harold H Lee, Carolina G Downie,
Moa P Lee, Sarah H Koenigsberg, Anna F Ballou, Michael H Preuss,
Laura M Raffield, Rina A Yarosh, et al. Branched chain amino acids harbor
distinct and often opposing effects on health and disease. Communications
Medicine, 3(1):172, 2023.
[224] Mai Yuanbang Zheng Yuanting Yu, Ying and Leming Shi. Assessing and
mitigating batch effects in large-scale omics studies. Genome Biology, 25(1),
2024.
[225] Bart JA Mertens. Transformation, normalization, and batch effect in the
analysis of mass spectrometry data for omics studies. Statistical Analysis of
Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry,
page 1, 2016.
[226] Dennis T Ko, Atul Sivaswamy, Maneesh Sud, Gynter Kotrri, Paymon Azizi,
Maria Koh, Peter C Austin, Douglas S Lee, Idan Roifman, George Thanassoulis, et al. Calibration and discrimination of the Framingham risk score
and the pooled cohort equations. Cmaj, 192(17):E442–E449, 2020.
[227] Ruth E Parsons, Xiaonan Liu, Jennifer A Collister, David A Clifton, Benjamin J Cairns, and Lei Clifton. Independent external validation of the
QRISK cardiovascular disease risk prediction model using UK Biobank.
Heart, 109(22):1690–1697, 2023.
[228] Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied
Mathematics, 6(1):76–90, 1970.
[229] Roger Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317–322, 1970.
[230] D Goldfarb. A family of variable metric updates derived by variational
means, v. 24. Mathematics of Computation, pages 21–55, 1970.
[231] David F Shanno. Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24(111):647–656, 1970.
[232] Anthony Letai. Functional precision cancer medicine—moving beyond pure
genomics. Nature Medicine, 23(9):1028–1035, 2017.
[233] Christoph Kornauth, Tea Pemovska, Gregory I Vladimer, Günther Bayer,
Michael Bergmann, Sandra Eder, Ruth Eichner, Martin Erl, Harald Esterbauer, Ruth Exner, et al. Functional precision medicine provides clinical
benefit in advanced aggressive hematologic cancers and identifies exceptional responders. Cancer discovery, 12(2):372–387, 2022.
[234] Jeffrey W Tyner, Cristina E Tognon, Daniel Bottomly, Beth Wilmot,
Stephen E Kurtz, Samantha L Savage, Nicola Long, Anna Reister Schultz,
Elie Traer, Melissa Abel, et al. Functional genomic landscape of acute
myeloid leukaemia. Nature, 562(7728):526–531, 2018.
[235] Tianduanyi Wang, Sandor Szedmak, Haishan Wang, Tero Aittokallio,
Tapio Pahikkala, Anna Cichonska, and Juho Rousu. Modeling drug
combination effects via latent tensor reconstruction. Bioinformatics,
37(Supplement_1):i93–i101, 2021.
104
References
[236] Abdulkadir Elmas, Kevin Spehar, Ron Do, Joseph M Castellano, and Kuanlin Huang. Associations of circulating biomarkers with disease risks: A twosample Mendelian randomization study. International Journal of Molecular
Sciences, 25(13):7376, 2024.
[237] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton,
John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray,
et al. UK biobank: an open access resource for identifying the causes of
a wide range of complex diseases of middle and old age. PLoS Medicine,
12(3):e1001779, 2015.
[238] Anna Fry, Thomas J Littlejohns, Cathie Sudlow, Nicola Doherty, Ligia
Adamska, Tim Sprosen, Rory Collins, and Naomi E Allen. Comparison of
sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American Journal of Epidemiology, 186(9):1026–1034, 2017.
105
Publication I
Heli Julkunen, Anna Cichońska, Prson Gautam, Sandor Szedmak, Jane
Douat, Tapio Pahikkala, Tero Aittokallio, Juho Rousu. Leveraging multiway interactions for systematic prediction of pre-clinical drug combination
effects. Nature Communications, December 2020.
© 2020 The Author(s). This is an open access article distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
107
ARTICLE
https://doi.org/10.1038/s41467-020-19950-z
OPEN
Leveraging multi-way interactions for systematic
prediction of pre-clinical drug combination effects
1234567890():,;
Heli Julkunen 1, Anna Cichonska 1,2,3, Prson Gautam 3, Sandor Szedmak
Tapio Pahikkala 2, Tero Aittokallio 1,3,4,5,6 ✉ & Juho Rousu 1 ✉
1, Jane Douat1,
We present comboFM, a machine learning framework for predicting the responses of drug
combinations in pre-clinical studies, such as those based on cell lines or patient-derived cells.
comboFM models the cell context-specific drug interactions through higher-order tensors,
and efficiently learns latent factors of the tensor using powerful factorization machines. The
approach enables comboFM to leverage information from previous experiments performed
on similar drugs and cells when predicting responses of new combinations in so far untested
cells; thereby, it achieves highly accurate predictions despite sparsely populated data tensors.
We demonstrate high predictive performance of comboFM in various prediction scenarios
using data from cancer cell line pharmacogenomic screens. Subsequent experimental validation of a set of previously untested drug combinations further supports the practical and
robust applicability of comboFM. For instance, we confirm a novel synergy between anaplastic lymphoma kinase (ALK) inhibitor crizotinib and proteasome inhibitor bortezomib in
lymphoma cells. Overall, our results demonstrate that comboFM provides an effective means
for systematic pre-screening of drug combinations to support precision oncology
applications.
1 Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland. 2 Department of Future Technologies,
University of Turku, Turku, Finland. 3 Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland. 4 Department of Mathematics
and Statistics, University of Turku, Turku, Finland. 5 Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway.
6 Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway. ✉email: tero.aittokallio@helsinki.fi; juho.rousu@aalto.fi
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
1
ARTICLE
C
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
ombination therapies are often required for treating
cancer patients with advanced stages of the disease. In
addition to overcoming monotherapy resistance, combinatorial treatments can also reduce toxicity of the treatment
(by reduced doses of the drugs) and improve therapeutic efficacy (by multi-targeting effect)1–3. With recent advances in
high-throughput screening methods, a systematic evaluation of
combinations among large collections of chemical compounds
has become feasible. This typically leads to large-scale experiments, in which the combinatorial responses are tested in
various doses of the individual compounds, resulting in
dose–response matrices that capture the measured combination
effects for every concentration pair in a particular sample (e.g.,
cancer cell line or patient-derived cells)4. However, even with
modern high-throughput instruments, experimental screening
of drug combinations quickly becomes impractical, as the
number of conceivable drug combinations increases rapidly
with the number of drugs in consideration. In addition, the
inherent heterogeneity of cancer cells pose further challenges
for the experimental efforts, as the combinations need to be
tested in various cell contexts and genomic backgrounds5,6.
Therefore, computational methods are often being used to
guide the discovery of effective combinations to be prioritized
for further pre-clinical and clinical validation7,8.
During the recent years, machine learning has emerged as a
powerful approach to aid the drug development process by
offering systematic means for the prediction of target bioactivities and drug-induced effects9–13, thereby providing guidance for drug discovery and repositioning efforts14,15. Until
recently, the performance of machine learning methods in
predicting drug combination effects was limited by the lack of
high-quality training data8. However, this is gradually changing
as increasing amounts of data from pre-clinical drug combination screens are becoming available, therefore creating new
opportunities also for the application of large-scale machine
learning methods4,5,16. For instance, the NCI-ALMANAC
dataset generated by the US National Cancer Institute (NCI)
provides over 3 million experimentally measured drug combination responses across various cell lines and tissue types4.
However, despite the potential value of such datasets, the high
dimensionality of the underlying dose–response data and the
inherent complexity of drug interaction patterns across various
doses pose challenges to accurate modeling of drug combination effects.
Several computational tools have been proposed for the prediction of drug combinations2,7,8,17. Many of these tools have
been systematically benchmarked in two crowdsourced DREAM
Challenge competitions18,19, which demonstrated that computational predictions can achieve high accuracies for selected drug
classes, provided there are enough drug information and training
data available. However, the focus of these challenges and most of
the previously proposed methods has been on directly predicting
drug combination synergies (i.e., whether the combined summary
effect is higher than expected). In many practical applications,
however, more detailed information on dose–response effects of
the combinations is required, rather than simply classifying the
summary effects into synergistic or antagonistic classes. Furthermore, as noted in the recent AstraZeneca-Sanger drug combination prediction DREAM challenge19, the performance of the
computational methods typically relies on selective incorporation
of target features and biological knowledge that is not always
available for all drugs and cell models. Therefore, there is a need
to develop integrative and robust models capable of generalizing
and learning from large amounts of available data that facilitate
the exploration of the extensive combinatorial drug and dose
spaces.
2
Here, we present comboFM, a novel machine learning framework for systematic modeling of drug-dose combination effects in
a cell context-specific manner. It is generally applicable to any
pre-clinical model systems, such as patient-derived primary cells,
but we demonstrate its performance here in cancer cell lines
(Fig. 1). We base our work on the observation that the drug
combination dose–response data can be compiled into a higherorder tensor indexed by drugs, drug concentrations, and cell lines.
comboFM then models the cell line-specific responses to a
combination of drugs as an interaction between the different
modes of the tensor using a higher-order factorization machine
(FM)20, a recently proposed machine learning approach for nonlinear learning on large data. FMs have been shown to be compelling tools with the ability to work particularly well with highdimensional and sparse datasets20–22. In contrast to existing
machine learning models, comboFM enables one to explore the
detailed landscape of drug combination responses across various
doses. We demonstrate that comboFM obtains high prediction
accuracy in various practical application scenarios, significantly
outperforming other approaches. Furthermore, we show the
robustness and practical potential of comboFM by experimentally
validating untested drug combinations predicted for specific cell
lines.
Results
Overview of comboFM model. comboFM was developed for
predicting drug combination responses of cancer cell lines in
three practical scenarios (Fig. 1a). The first scenario of predicting
new dose–response matrix entries corresponds to filling in the
gaps in partially measured dose–response matrices. In the second
scenario of new dose–response matrix inference, the predictions
are made for completely held out dose–response matrices of
untested drug–drug–cell line triplets, such that the drug pair has
still been observed in other cell lines. In the third and most
challenging scenario of new drug combination inference, the
predictions are made for completely new drug combinations with
no available combination measurements in any cell line, thereby
providing guidance on repositioning of the drugs for new combinations and cell contexts.
To capture the high-order interactions between drug combinations in different cell lines and at various doses, comboFM models
the multi-way interactions between the two drugs, the cell lines
and the dose–response matrices as a fifth-order data tensor X
(Fig. 1b). Furthermore, comboFM makes it possible to integrate
any auxiliary data of the drugs and cell lines, such as chemical
descriptors in the form of molecular fingerprints of drug
compounds, gene expression profiles of the cancer cell lines and
concentration values tested for the drugs.
For the learning algorithm, the data tensor X is flattened into a
two-dimensional array (Fig. 1c), where each row vector x
identifies a single entry in the original tensor. Given the
associated responses yi in the training data, comboFM model is
learned using factorization machines (FMs). Higher-order FMs
learn a non-linear regression model from the input features (x) to
the output (y) by estimating a regression weight wi1 ;:::;it for each
combination of input features xi1 xi2 xit , where t is the order
of the interaction. However, instead of estimating the weights
wi1 ;:::;it separately as in polynomial regression, FMs approximate
the weights using factorized parametrization (Fig. 1d), where the
weights are coupled through multiplication of latent factors
learned by the FM. This approach avoids the computational and
statistical problems that would result from directly estimating the
weight tensor W. In addition, the coupling of the weights allows
effective learning in situations where the data tensor is sparsely
populated.
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
[D2]
x
x
x x
x
x x
x
x x
[D2]
C
x x
x x
x
x x
C
C
[D1]
Predicting new doseresponse matrix entries
[D1]
[D1]
Predicting new doseresponse matrices
Predicting new drug
combinations
b
X
Chemical descriptors
(e.g. fingerprint)
Cell lines
Tensor representation
Genomic descriptors
(e.g. gene expression)
Dr
ug
2
…,0,1,0,0,0,1,0,0,…
Drug 1
Chemical descriptors
(e.g. fingerprint)
Drug 2 concentration
…,0,1,0,0,0,1,0,0,…
Drug 1 concentration
c
Feature representation
x
x
%-growth
x
x x x
[D2]
%-growth
%-growth
Prediction scenarios
a
d features
x1 1
x2 0
x3 1
xn 0
y1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0.3 0.9
0.1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1.1 1.2
0.1 1.0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0.4 0.0
1.0 0.1
y2
y3
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
1
0
0
0.2 0.1
0.3 1.0
yn
Drug 1
Drug 2
Cell line
Drug 1
concentration
0.3
Drug 2
Drug 1
Drug 2
Cell line Concen- Response
concentration chemical chemical genomic tration
descriptor descriptor descriptor values
Binary representation of the tensor
structure (one-hot encoding)
Additional real-valued or
binary descriptors
d
k
d
Parameter estimation
k
pi 3:
pi 1:
d
Wi 1,i 2,i 3
≈
×
d
d
k
pi 2:
d
W ∈Rd
3
P ∈Rd ×k
Fig. 1 Overview of the comboFM framework for the prediction of drug-dose combination effects. a Three prediction scenarios are considered: filling in
missing entries in partially tested dose–response matrices, predicting a complete dose–response matrix in a new cell line, and making predictions for a
completely new drug combination not tested so far in any cell line. b In each prediction scenario, the experimentally measured dose–response matrices are
compiled into a fifth-order tensor X indexed by drugs (D1, D2), drug concentrations ([D1], [D2]) and cell lines (C), and genomic and chemical descriptors are
integrated into the prediction model. c The structure of the tensor underlying the drug combination dose–response matrix data is one-hot encoded into a
single feature matrix together with the additional chemical and genomic descriptors. d The model parameters wi1 ;i2 ; ¼ ;it , for a tth order combination (t = 3
P
depicted) of features i1, …, it are approximated using factorized parametrization wi1 ;i2 ; ¼ ;it ks¼1 p1s p2s ¼ pts (see “Methods”). d denotes the total number
of features and k is a hyperparameter defining the rank of the factorization.
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
3
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
Accurate drug combination response predictions by comboFM.
To systematically evaluate the comboFM model, we used the
anticancer drug combination response data from the NCIALMANAC study4. To enable various splits of data into different cross-validation folds as required by the different prediction scenarios and to keep the computational complexity
manageable, we considered a subset of the data consisting of 50
unique FDA-approved drugs (Supplementary Table 3) in 617
distinct combinations screened in various concentration pairs
across all the 60 cell lines originating from 9 tissue types23. In
this data subset, a total of 333,180 drug combination response
measurements and 222,120 monotherapy response measurements of single drugs are available in the form of percentage
growth of the cell lines (see “Methods”). To computationally
quantify the performance of comboFM in predicting drug
combination responses and optimize the model parameters, we
performed a 10 × 5 (10 outer folds, 5 inner folds) nested crossvalidation (CV) procedure under the three prediction scenarios
(see “Methods”). The order of the feature interactions modeled
by the FM was set to m = 5, according to the order of the
underlying tensor.
To investigate the benefit of considering higher-order feature
interactions, we also performed experiments using both second
order formulation of FMs and first order FMs (corresponding to
ridge regression). To further benchmark the predictive performance of comboFM, we applied random forest (RF) as a
reference model, a widely-used machine learning model that is
based on a rather different learning principle, and has previously
been used for modeling drug combination effects24–28, including
the winning method of the recent AstraZeneca-Sanger drug
combination prediction DREAM Challenge19. The crossvalidation folds were held fixed throughout the experiments to
ensure a fair comparison. We assessed the predictive performance
of the methods using root mean squared error (RMSE), as well as
Pearson and Spearman correlation between original and predicted dose–response matrices.
By leveraging the multi-way interactions present in the
underlying high-dimensional drug combination space across
drugs, drug concentrations, and cancer cell lines, the 5th order
comboFM demonstrated high predictive accuracy in all the three
prediction scenarios (Fig. 2), outperforming the random forest
reference (p < 10−10 in all prediction scenarios, two-sided
Wilcoxon paired signed rank sum test, N = 666,360). In the
scenarios of predicting new dose–response matrix entries and
new dose–response matrices, the 5th order comboFM obtained a
Pearson correlation of 0.97, and even in the new drug
combination prediction scenario, the 5th order comboFM
obtained a Pearson correlation of 0.95. The 5th order comboFM
was also markedly more accurate than both the 1st- and 2nd
order comboFMs in all the three scenarios. Similar relative
performance of the methods was also observed using Spearman
correlation and RMSE (Fig. 2). In addition, the distribution of the
predictions by 5th order comboFM followed that of the measured
responses most accurately (Supplementary Fig. 1).
In addition to the global predictive performance of the
methods, we analyzed also their performance in different tissue
types and across the various types of drug combination therapies
(Fig. 3 and Supplementary Figs. 2–4 and Supplementary Table 1).
In all the three prediction scenarios (Fig. 3a–c), comboFM
showed the highest average prediction accuracy in each of the
tissue types, and also the smallest variance across the tissue types.
The combination response in colon cancer appeared marginally
more difficult to predict than the other tissue types, which is likely
explained by higher variation in the colon cancer response data,
as the number of colon cancer cell lines was similar to the other
tissue types and thus the marginally inferior performance is
4
unlikely to stem from limited data quantity. Nevertheless, the 5th
order comboFM was still the most accurate method also in colon
cancer cell lines. Furthermore, comboFM was shown to provide
high accuracies across various types of combination therapies
(chemotherapies, targeted therapies, and other therapies, such as
hormonal therapies) (Fig. 3d–f). The combination therapies
involving drugs from the Other class include the smallest number
of observations, explaining their reduced predictive accuracy with
all the methods.
To further validate the performance of the 5th order
comboFM, we also evaluated its predictive accuracy in the
remaining part of the NCI-ALMANAC data that was not used in
the cross-validation, consisting of 4737 distinct drug combinations. The model was trained on the full development dataset of
617 drug combinations as well as the monotherapy responses of
the single drugs in the validation set, and the trained model was
then used for predicting responses of the 4737 drug combinations
in the validation set across the various cell lines. 5th order
comboFM demonstrated high predictive accuracy also in this
validation set (Supplementary Figs. 5 and 6), with Pearson
correlation of 0.91 even for combinations where neither drug had
previously been observed in any other combination, i.e. only the
monotherapy responses of the individual drugs in the combination were available to the model.
Synergy scores can be recovered with high accuracy based on
the predicted dose–response matrices. As the interest in drug
combination experiments often lies in discovering the most
synergistic drug combinations, we also quantified drug combination synergies based on the dose–response matrices predicted
with comboFM. As a synergy quantification model, we applied
NCI ComboScore (see “Methods”)4, computed over the complete
predicted dose–response matrices. Although drug combinations
with an NCI ComboScore above zero are technically defined to be
synergistic, combinations with highly synergistic effects are
typically considered as more attractive candidates for further
experimental validation. Therefore, we labeled the extreme
synergistic drug combinations (observed NCI ComboScore value
in the top 10%) as the positive class and the remaining combinations, including lowly synergistic, additive, and antagonist
combinations, as the negative class.
Drug combination synergy scores were recovered with a high
accuracy from the dose–response matrices predicted by the 5th
order comboFM in all three prediction scenarios, significantly
outperforming the other compared methods (Supplementary
Fig. 7). Importantly, the drug combination synergies could be
accurately computed based on the predicted dose–response
matrices using 5th order comboFM even in the challenging
scenario of predicting new drug combinations, with a Pearson
correlation of 0.72 (p < 10−10, two-sided t-test, N = 74,040)
between the observed and predicted NCI ComboScores. In the
task of discriminating highly synergistic drug combinations, the
5th order comboFM obtained a high area under the receiver
characteristic operator curve (AUC) of 0.91 in the new drug
combination prediction task (Supplementary Fig. 8). The
discrimination accuracies were at high level in each prediction
scenario, and when using various top-% extreme synergy
combinations (Supplementary Fig. 8).
Experimental validation of the most synergistic predicted drug
combinations. To further demonstrate the ability of comboFM
to predict novel and robust drug combinations, the model was
trained using all the available dose–response measurements in
the development dataset, and the trained comboFM was then
used to predict dose–response matrices for remaining
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
a Predicting new dose–response matrix entries
RMSE : 17.89
RSpearman : 0.91
100
y = 3.4 + 0.95 x
0
–100
NCI ComboScores, RPearson : 0.92
–200
–200
–100
0
100
100
200
y = 13 + 0.82 x
0
–100
NCI ComboScores, RPearson : 0.66
–200
–200
200
–100
RMSE : 10.91
RMSE : 31.56
RPearson : 0.91
RSpearman : 0.84
0
100
RPearson : 0.70
200
RSpearman : 0.66
100
Predicted response, %-growth
200
Predicted response, %-growth
Predicted response, %-growth
RPearson : 0.97
Predicted response, %-growth
RMSE : 9.86
200
y = 43 + 0.4 x
0
–100
NCI ComboScores, RPearson : –0.19
–200
–200
200
–100
0
100
RPearson : 0.97
RSpearman : 0.91
100
y = 6.8 + 0.91 x
0
–100
NCI ComboScores, RPearson : 0.83
–200
200
–200
–100
0
100
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
comboFM-5
comboFM-2
comboFM-1
RF
200
b Predicting new dose–response matrices
RMSE : 18.00
RSpearman : 0.91
100
y = 3.4 + 0.95 x
0
–100
NCI ComboScores, RPearson : 0.84
–200
–200
–100
0
100
200
RSpearman : 0.83
100
y = 13 + 0.82 x
0
–100
NCI ComboScores, RPearson : 0.63
–200
–200
200
–100
RMSE : 12.23
RMSE : 31.57
RPearson : 0.91
0
100
RPearson : 0.70
200
Predicted response, %-growth
200
Predicted response, %-growth
Predicted response, %-growth
RPearson : 0.97
Predicted response, %-growth
RMSE : 10.39
200
RSpearman : 0.66
100
y = 43 + 0.4 x
0
–100
NCI ComboScores, RPearson : –0.19
–200
200
–200
–100
0
100
RPearson : 0.96
RSpearman : 0.90
100
y = 7.9 + 0.89 x
0
–100
NCI ComboScores, RPearson : 0.69
–200
–200
200
–100
0
100
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
comboFM-5
comboFM-2
comboFM-1
RF
200
c Predicting new drug combinations
y = 5.1 + 0.93 x
0
–100
NCI ComboScores, RPearson : 0.72
–200
–200
–100
0
100
200
RPearson : 0.89
200
Predicted response, %-growth
200
Predicted response, %-growth
Predicted response, %-growth
100
RMSE : 31.79
RMSE : 19.37
RPearson : 0.95
RSpearman : 0.88
RSpearman : 0.81
100
y = 14 + 0.81 x
0
–100
NCI ComboScores, RPearson : 0.52
–200
–200
–100
0
100
200
RMSE : 15.44
RPearson : 0.69
200
Predicted response, %-growth
RMSE : 13.04
200
RSpearman : 0.66
100
y = 43 + 0.39 x
0
–100
NCI ComboScores, RPearson : –0.21
–200
–200
–100
0
100
200
RPearson : 0.93
RSpearman : 0.86
100
y = 12 + 0.84 x
0
–100
NCI ComboScores, RPearson : 0.48
–200
–200
–100
0
100
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
Measured response, %-growth
comboFM-5
comboFM-2
comboFM-1
RF
200
Fig. 2 Predictive performance of 5th (comboFM-5), 2nd (comboFM-1) and 1st order comboFM (comboFM-1), and random forest (RF) as scatter plots
between the measured and predicted dose–response matrices. The responses were measured by percentage growth in the three prediction scenarios:
a new dose–response matrix entries, b new dose–response matrices, and c new drug combinations. Root mean squared error (RMSE), Pearson correlation
(RPearson) and Spearman correlation (RSpearman) for the drug combination response prediction are reported as averages over 10 outer CV folds. The
Pearson correlation of the NCI ComboScores is reported as an average over all computed NCI ComboScores, computed based on the predicted
dose–response matrices. Trend line and its equation are shown for each scatter plot.
unmeasured drug combinations across all the 60 cell lines,
which resulted in a total of 10,320 predicted complete
dose–response matrices. Experimental validation was performed subsequently on a subset of 16 drug combinations
specific for 4 cell lines (Supplementary Table 2), where high
synergy was predicted by comboFM. These combinations were
selected to mainly involve molecularly targeted therapies, as the
recent interest has increasingly evolved toward targeted agents
over the standard cytotoxic chemotherapies. In particular, we
focused on cancer-specific drug combinations which were
predicted to have highly synergistic effects only in a subset of all
the cell lines and tissue types. This poses a more challenging
task than identifying broadly toxic combinations that kill most
cancer cells, but which may also induce severe toxicities in the
healthy cells. As in the previous experiments, we considered as
highly synergistic those combinations with an observed NCI
ComboScore values in the top 10% in a particular tissue type.
The results of the experimental validation of 16 drug–drug–cell
line triplets are summarised in Fig. 4, using the Bliss model to
quantify the observed synergy. The background histogram shows
a distribution of an in-house drug combination dataset, consisting
of 60 drug combinations tested against 16 KRAS-mutants
pancreatic ductal adenocarcinoma cell lines. Since the combinations in the reference set were not randomly-selected, the
background synergy distribution shows a slight positive bias;
however, since the assay was the same as the one used for the
experimental validation of comboFM predictions (“Methods”), it
is expected to provide a valid reference distribution for statistical
evaluations. All the drug combinations predicted by comboFM
were validated as synergistic, when considering positive Bliss
score as evidence for a degree of synergy (p < 10−4, binomial test
against the background distribution). Importantly, 9 out of 16
combinations had a Bliss synergy score higher than 90% of the
background distribution (p < 10−5, binomial test). In addition to
Bliss synergy score, we also computed the synergy scores using
three other popular synergy models: Loewe, highest single agent
(HSA) and zero-interaction potency (ZIP) scores (Supplementary
Figs. 9 and 10). These results demonstrate the robustness of the
comboFM predictions across various experimental setups and
synergy scoring models.
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
5
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
Predicting new dose-response matrix entries
a
d
Tissue type
Breast
CNS
Colon
Drug classes
Ovarian
Prostate
Renal
Haematological
Melanoma
NSC Lung
Chemotherapy - Chemotherapy
Chemotherapy - Other
Other - Other
1.0
Targeted - Chemotherapy
Targeted - Other
Targeted - Targeted
1.0
Pearson correlation
Pearson correlation
0.9
0.8
0.7
0.6
0.5
0.0
0.5
0.4
comboFM-5
comboFM-2
comboFM-1
RF
comboFM-5
comboFM-2
comboFM-1
RF
Predicting new dose-response matrices
b
e
Tissue type
Breast
CNS
Colon
Drug classes
Ovarian
Prostate
Renal
Haematological
Melanoma
NSC Lung
Targeted - Chemotherapy
Targeted - Other
Targeted - Targeted
Chemotherapy - Chemotherapy
Chemotherapy - Other
Other - Other
1.0
1.0
Pearson correlation
Pearson correlation
0.9
0.8
0.7
0.6
0.5
0.0
0.5
0.4
comboFM-5
comboFM-2
comboFM-1
RF
comboFM-5
comboFM-2
comboFM-1
RF
Predicting new drug combinations
c
f
Tissue type
Breast
CNS
Colon
Haematological
Melanoma
NSC Lung
Drug classes
Ovarian
Prostate
Renal
Chemotherapy - Chemotherapy
Chemotherapy - Other
Other - Other
Targeted - Chemotherapy
Targeted - Other
Targeted - Targeted
1.0
1.0
Pearson correlation
Pearson correlation
0.9
0.8
0.7
0.6
0.5
0.0
0.5
0.4
comboFM-5
comboFM-2
comboFM-1
RF
comboFM-5
comboFM-2
comboFM-1
RF
Fig. 3 Predictive performance of 5th (comboFM-5), 2nd (comboFM-2) and 1st order comboFM (comboFM-1), and random forest (RF) across tissue
types and drug classes in the three prediction scenarios. a–c tissue types. d–f drug classes. The three prediction scenarios are depicted as follows:
a, d predicting new dose–response matrix entries, b, e predicting new dose–response matrices, and c, f predicting new drug combinations. Further
information on the drug classes can be found in Supplementary Table 3. In the boxplots, the horizontal lines drawn in the middle denote the median, and
the lower and upper hinges correspond to the 25th and 75th percentiles, respectively. The upper and lower whiskers denote the largest and smallest
values, respectively, no further than 1.5 times the inter-quartile range (IQR). The points that are not included between the whiskers are outlier predictions.
6
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
Experimentally tested combinations selected based on comboFM predictions
Everolimus - Romidepsin - MALME-3M
Erlotinib - Axitinib - IGROV1
Oxaliplatin - Romidepsin - MALME-3M
Romidepsin - Everolimus - HS578T
Romidepsin - Vismodegib - MALME-3M
Teniposide - Lenalidomide - SR
Gefitinib - Vismodegib - MALME-3M
Lomustine - Gefitinib - IGROV1
Lomustine - Gefitinib - SR
Dactinomycin - Romidepsin - MALME-3M
Fulvestrant - Crizotinib - SR
Vandetanib - Exemestane - IGROV1
Cladribine - Romidepsin - MALME-3M
Crizotinib - Bortezomib - SR
Thioguanine - Megestrol acetate - SR
Vismodegib - Romidepsin - HS578T
50%
Number of combinations
(reference dataset)
175
75%
90%
99%
150
125
100
75
50
25
0
−20
−10
0
Bliss synergy score
10
20
Fig. 4 Measured drug combination synergy scores in the experimental validation. In-house experimental validation of 16 selected predictions in specific
cell lines are shown as colored lines (on top), and the histogram shows a background distribution from in-house reference dataset that comprises of 60
drug combinations tested against 16 KRAS-mutants pancreatic ductal adenocarcinoma cell lines (see “Methods”). The synergy was quantified using Bliss
independence score for the most synergistic area of the dose–response matrix (see Supplementary Fig. 9 for other synergy scores). The color scale
corresponds to the Bliss scores (green—antagonistic response, white—independent response, red—synergistic response). Dashed lines denote the
percentiles of the background distribution obtained using the same experimental setup.
Among others, comboFM predicted a particularly high level of
synergy for the combination between anaplastic lymphoma
kinase (ALK) inhibitor crizotinib and proteasome inhibitor
bortezomib in lymphoma cell line SR. In addition to our inhouse experimental validations, this finding was further validated
in external measurements in the NCI-ALMANAC data that were
not used as part of comboFM training data. The ALK inhibitors
are effective against cancers harboring ALK fusions. The SR cell
line carries the NPM1-ALK fusion, which is the first ever
discovered ALK fusion in large-cell lymphoma29. Bortezomib is
approved for mantel cell lymphoma supporting its potential in
lymphoma treatment. It is likely that two even mildly effective
inhibitors when used in combination may enhance the inhibition
effect and potentially overcome monotherapy resistance. Notably,
comboFM made this prediction without knowledge of the ALK
fusion status of the SR cell line, i.e., this biological rationale was
not available for the model. The prediction of high synergy
between the first-generation inhibitors of ALK and proteasome
for lymphoma cell lines highlights the potential of comboFM to
predict biologically plausible combination effects.
The comboFM model identified also another unique drug
combination effective against the SR cell line, the combination of
EGFR inhibitor gefitinib with an approved chemotherapy
lomustine for lymphoma treatment. One of the mechanisms
inducing resistance to ALK inhibitors is activation of EGFR, as
they signal through similar downstream pathways. Brigatinib, a
dual ALK/EGFR inhibitor, is therefore being explored in clinical
settings against lymphoma and lung cancer patients
(NCT01449461). Our comboFM method predicted combination
partners to extensively explored ALK and EGFR inhibitors for
lymphoma, which we were able to also validate in the
experimental setting (Fig. 4). These examples show the potential
of comboFM to identify novel combinations of both targeted and
cytotoxic treatments, that individually are already used as
lymphoma treatments, and therefore are likely to have acceptable
toxicity profiles in clinical applications.
Discussion
Given the enormous number of conceivable drug and dose
combinations, computational approaches are needed to accelerate
the experimental work by providing guidance toward identifying
the most promising drug combinations for further experimental
validation. While large datasets of drug combination
dose–response matrices have already been tested in the lab,
extensive gaps still remain in the combinatorial space among both
targeted and non-targeted therapies, as well as hormonal and
immunotherapies. Here, we have presented a novel machine
learning framework, comboFM, for large-scale systematic prediction of drug combination effects in human cancer cell lines.
The obtained results demonstrate that comboFM can leverage
predictive higher-order relationships between drugs, drug concentrations, and cancer cell line responses, which were missed
when using random forest and simpler approaches, including 1st
and 2nd order formulation of comboFM. Importantly, comboFM
can accurately generalize the predictions also for new drug
combinations not observed in the training space, which enables
one to systematically predict dose–response matrices also for so
far untested drug combinations formed by the individual drugs in
the training set. This will provide guidance on repositioning the
drugs into new combinations. We also demonstrated that comboFM consistently obtains high prediction performance across
various tissue types and classes of drug combination therapy. In
addition, 5th order comboFM was 3 times faster to train compared to the random forest reference when run on the same CPU
and considering relatively conservative amount of 200 training
epochs for training the comboFM model (Supplementary
Table 2). Further performance advantages were obtained by
employing a GPU for training the 5th order comboFM model (34
times faster compared to random forest).
Modeling the drug combination effects first at the level of
dose–response matrices and subsequently quantifying the level of
overall drug combination synergy over the full matrix provides
many benefits compared to approaches that directly aim at predicting the drug combination synergies. First of all, predicting the
underlying dose–response matrices enables one to leverage all the
information contained in the dose–response matrices and provides detailed information of the response landscape across various dose combinations. In addition, in the second stage, one is
not limited only by a single synergy quantification model, but can
explore the synergies using various models, hence gaining a more
comprehensive view of the synergistic drug combination
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
7
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
landscapes30. Furthermore, understanding the drug combination
effects both at the dose level as well as at the synergy level provides useful guidance for precision medicine efforts. For instance,
combination synergies observed at lower doses are often better
tolerated in the clinical practice. Furthermore, it has been shown
that for most of the FDA-approved drug combinations, only little
evidence of additivity or synergy was observed in pre-clinical
models31, highlighting that synergy is not always needed for
clinical treatment success. However, it has also been argued that
patient stratification based on predictive markers is likely to
reduce variability in clinical therapy responses, and contribute to
achieving truly synergistic responses to combination
treatments32.
In-house experimental validations of the top-synergistic combinations predicted using the NCI-ALMANAC data demonstrated
that the comboFM predictions are robust also to the experimental
setup. The in-house assay had many experimental differences when
compared to the combination assay used to profile the NCIALMANAC development dataset. In particular, the in-house assay
measured the drug combination responses in the form of percentage inhibition, instead of percentage growth that is used in the
NCI-ALMANAC assay. Therefore, we could not calculate the NCI
ComboScore for the experimental validations, but instead scored
the combinations using four popular synergy models (Supplementary Figs. 9 and 10). As an example, comboFM predicted a
pivotal role of histone deacetylase (HDAC) in melanoma cell line
MALME-3M, thereby suggesting potential of HDAC inhibition
against melanoma. In particular, various combinations with HDAC
inhibitor romidepsin were predicted to be effective against BRAFmutants melanoma cell line MALME-3M, which also held true in
the experimental settings (Fig. 4). Even though most of the drugs in
the romidepsin-combinations have already been explored in different combinations to target melanoma33,34, the combinations
predicted by comboFM have remained unexplored against melanoma, and warrant further investigation. Individually, each of these
inhibitors have shown promising results in pre-clinical or clinical
settings against melanoma, further supporting their use in combination therapies.
Even though the main objective of this work was to develop
and carefully validate the comboFM model in cancer cell lines as
an accurate methodology for systematic prediction of drug
combination responses for biological discovery, we note that
many of the drugs identified by comboFM have been or are
currently being explored in clinical settings against the specific
cancer type, either as single agents or in combination with other
drugs (see Supplementary Table 5). For instance, HDAC inhibitor
vorinostat is being tested against BRAF-mutant advanced melanoma in an ongoing clinical trial (ref. 35; NCT02836548). Similarly, mTOR inhibitor everolimus is shown to selectively target
BRAF-mutant melanoma in acidic condition36. In an ongoing
clinical trial, mTOR inhibitors everolimus or temsirolimus in
combination with BRAF inhibitor are being investigated against
BRAF-mutant advanced solid tumors (NCT01596140). SMOinhibitor vismodegib blocks Hedgehog pathway which regulates
the skin growth. In case of medulloblastoma, HDAC inhibitors
are active against even SMO-inhibitor resistant cell lines37.
Hence, concurrent use of HDAC- and SMO- inhibitors holds a
promising strategy to target melanoma, as predicted by romidepsin and vismodegib combination (Fig. 4). In the same line of
rationale, combining HDAC inhibitor with DNA damaging
agents, such as oxaliplatin, dactinomycin, and cladribine, holds
strong promises and are explored in different pre-clinical and
clinical settings33,34,38,39.
These case examples already unveil the potential of our method
for predicting combinations with translational potential, although
these findings warrant further validation in proper clinical trials.
8
Furthermore, once the model accuracy has been confirmed in the
cell line resources, we envision that the carefully validated model
will be applicable also to data from individual cancer patients,
thereby providing means for tailoring effective combinations in
precision oncology applications. For selected cancer types, such as
haematological malignancies, molecular and drug response profiling data are becoming available from patient-derived primary
cells that can be used for training cancer type-specific prediction
models40,41. Once similar data from other cancer types becomes
available, comboFM will enable also pan-cancer analyses, similar
to the current analyses in the NCI-ALMANAC cell lines. We
found that many of the combinations predicted in the NCIALMANAC cell lines have actually already been tested in clinical
trials (Supplementary Table 5). Interestingly, most of the combinations are tested in different indications than what was predicted based on the cell lines, suggesting further drug repurposing
opportunities. The comboFM predictions require input data that
start to be routinely available in many functional precision
medicine studies, making it therefore broadly applicable for many
cancer types and therapy classes.
In the present study, we assumed that one knows the
monotherapy responses of single drugs prior to predicting the
combination responses, as in practice it is often needed to know
the concentration ranges and potencies of the single drugs (i.e.,
dose–response curves) in order to know which dose combinations should be used in combination testing, and also how
potent the compounds are individually. comboFM strongly
benefits from this information due to its capability to interpolate in the space of dose–response matrices through the
computation of latent factors representing similarly behaving
drug combinations from the response tensor alone (similarly to
recommender systems grouping users by the movies they have
liked in the past), while the drug and cell line descriptors
merely fine-tune the predictions. It is plausible that by careful
experimental design, one could minimize the number of
monotherapy responses needed for accurate dose–response
matrix prediction42 whilst maintaining the accuracy of the
comboFM model, which we leave as an interesting future
research topic. However, in a scenario where one would like to
perform predictions for completely new molecules with no
prior monotherapy or combination response data in any cell
line, the computed latent factors are no longer helpful, and
none of the methods could perform well with the current design
(Supplementary Fig. 13). This limitation of the methodology in
such scenarios could potentially be addressed by more extensive
feature engineering or by developing models that are specialized
for the case of predicting dose–response matrices for combinations of completely new drugs.
As with any high-throughput pre-clinical data, the cell line
drug response profiles may show inconsistency in experimental
outputs across the same cell line-treatment pairs43. Therefore, we
argue that it is important to develop and initially evaluate the
prediction models in large enough and standardized cell line
resources, such as NCI-ALMANAC, to avoid any reproducibility
issues in the development phase. We further tested the model
predictions using distinct experimental setups in the same cell
lines to show that the predictions were robust enough against
such biological and technical variability.
In conclusion, given the high cost of the experimental
screening of drug combinations, comboFM has the potential to
provide time- and cost-effective means toward prioritizing the
most promising drug combinations for further pre-clinical or
clinical studies. The accurate and robust drug combination
response predictions provide a promising approach to streamline
the development and expansion of combination therapeutics in
personalized cancer treatment. This could ultimately accelerate
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
the clinical use of combination therapeutics to combat acquired
drug resistance and to increase therapeutic efficacies.
and q:
yðA; BÞ ¼
X
ðyc ðAp ; Bq Þ ye ðAp ; Bq ÞÞ
p;q
Methods
Higher-order factorization machines. comboFM uses higher-order factorization
machines (HOFM)20,21 for predicting the drug–drug combination responses.
HOFMs are non-linear regression models learned with a training set of examples
fðx1 ; y1 Þ; ðx2 ; y2 Þ; :::; ðxn ; yn Þg
of feature vectors x 2 Rd and output labels y 2 R.
A trained HOFM models the output y 2 R as a function of single, pairwise, and
higher-order interactions between input features up to order m:
^yðxÞ :¼
d
X
wi x i þ
i¼1
X
1 ≤ i < i0 ≤ d
X
wi;i0 xi xi0 þ ::: þ
1 ≤ i1 < ::: < im ≤ d
wi1 ;i2 ;im xi1 xi2 :::xim :
ð6Þ
where yc(Ap, Bq) is the combination growth fraction of the cell line exposed to
drug A in concentration p and drug B in concentration q, and ye(Ap, Bq) is the
expected growth fraction for the combination defined based on the monotherapy
effects of drug A and drug B as follows:
(
minðym ðAp Þ; ym ðBq ÞÞ if ym ðAp Þ ≤ 0 or ym ðBq Þ ≤ 0
ye ðAp ; Bq Þ ¼
ð7Þ
1
ym ðAp Þ ~ym ðBq ÞÞ otherwise
150 ð~
where ym(Ap) and ym(Bq) denote the monotherapy effects of drug A in
concentration p and drug B in concentration q, respectively. We applied ~ym ¼
minðym ; 150Þ that truncates the growth fraction at 150, with the threshold
selected based on the histogram of the measured drug combination responses
(Supplementary Fig. 11).
ð1Þ
The first term corresponds to a linear model, and all parameters wi are
independently estimated. The higher-order parameters are, on the other hand,
estimated in a factorized form
wi;i0 ¼ hpi ; pi0 i
ð2Þ
ð2Þ
ð2Þ
ðtÞ
ðtÞ
ð3Þ
ðtÞ
wi1 ;i2 ;it ¼ hpi1 ; pi2 ; ¼ ; pit i; t ¼ 3; ¼ ; m
k
where pm
i 2 R denotes the mth order factor weight of feature i, k is the
hyperparameter defining the rank of the factorization, and
ha1 ; a2 ; ¼ ; am i ¼
k
X
a1s a2s ams
ð4Þ
s¼1
denotes a generalized inner product of m vectors ai 2 Rk ; i ¼ 1; ¼ ; m that
generalizes the usual pairwise inner product ⟨a, b⟩ = aTb to sets of m vectors.
d´k
m T
.
The factor weights are collected into matrices PðmÞ ¼ ðpm
1 ; ¼ ; pd Þ 2 R
The factorized parametrization drastically reduces the number of estimated
parameters from O(dm) (all feature combinations have their own parameter) to O
(kdm) (m − 1 factor matrices of dimension d × k). In principle HOFMs allow an
unique rank kt for each order t = 2, ..., m. In the above description and in our
experiments, we used uniform rank k = k2 = … = km.
FMs are based on the assumption that the effect of pairwise and higher-order
feature interactions has a low rank and allows FMs to estimate reliable parameters
even under highly sparse data. Hence, the co-occurrence of xi and xi0 does not need
to be observed in order to learn wi;i0 : the factors pi0 : and pi0 : can be learned by
interacting with other dimensions and the dot product of pi0 : and pi0 : still gives wi;i0 .
This is extremely useful in the case of high-dimensional drug combination data
where the input tensor is typically very sparse, and thus allows to make reliable
inferences of the responses to new drug combinations whose individual
components have still been observed in other combinations elsewhere in the
training tensor. Compared to standard matrix factorization approaches, FMs
provide additional flexibility by allowing integration of auxiliary data describing the
drugs and cell lines, such as chemical and genomic descriptors.
The objective function of learning higher-order factorization machines is to
minimize the regularized mean squared error
min
n m
X
2 β
1X
βt ðtÞ 2
y y^i ðxi Þ þ 1 jjwjj2 þ
jjP jj
n i¼1 i
2
2
t¼2
ð5Þ
where β1, . . . , βm > 0 are regularization parameters. To limit the number of
hyperparameter combinations to search, following the work by Blondel et al.20, we
set β1 = . . . = βm, and a uniform rank k = k2 = . . . = km. In the experiments, we
used a recent TensorFlow implementation of higher-order factorization
machines44.
On the NCI-ALMANAC data, increasing the order and rank of the
factorization machine both improve the predictive performance (Pearson
correlation) of the comboFM model (Supplementary Fig. 12). The predictive
performance increases steeply until order 5, which matches the intrinsic order of
the data tensor X (See Fig. 1b), and then continues to increase more slowly. The
performance increase due to increasing rank of the factorization is rapid until
around rank 50 and then continues to increase more slowly. There is no apparent
overfitting even with factorization order as high as 10 and rank as high as 150.
Synergy quantification. As the interest often lies in discovering the most synergistic drug combinations, we quantify the drug combination synergies based on the
predicted dose–response matrices. To compute the synergy scores, we apply the
NCI ComboScore, which was introduced along with the NCI-ALMANAC dataset4,
originally modified from the Bliss independence score.
The NCI ComboScore for drug A and drug B is defined as the sum of the
deviations between expected and observed responses over all concentrations p
Training setup. In order to evaluate the predictive performance and optimize the
model parameters under the three prediction scenarios, we performed a 10 × 5 (10
outer folds, 5 inner folds) nested cross-validation procedure. For all the factorization machine models, the rank parameter was optimized in the range k =
{25, 50, 75, 100} and the regularization parameter in the range β =
{102, 103, 104, 105}. The order of the modeled feature interaction was set to 5
according to the order of the underlying tensor, as a compromise between the
training time and prediction accuracy. The learning rate was set to 0.001 based on
preliminary experiments and other parameters were kept in their default values.
The number of trees of the random forest model was optimized in the range
{32, 64, 128, 512} and the fraction of features considered when looking for the best
split (MaxFeatures) in the range {0.25, 0.5, 0.75, 1.0}.
As each input sample is represented by a single feature vector, in order to take
the symmetry of the drug combinations into account, the samples were duplicated
such that both of the drugs in a combination were included in both positions in the
feature vectors. This informs the algorithm that the combination of drug A with
drug B should be considered the same as the combination of drug B with drug A.
The prediction accuracy of all the models was assessed using the same performance
evaluation metrics: RMSE, Pearson correlation, and Spearman correlation.
Evaluation of the prediction performance. In this type of applications, the predictive performance is significantly affected by whether the training and test sets
share the different components of the modeled interactions, and it is thus
important to reliably quantify the prediction accuracy under practical application
scenarios. Therefore, we evaluated the predictive performance of comboFM under
three prediction scenarios: (a) new dose–response matrix entry prediction, (b) new
dose–response matrix prediction and (c) new drug combination prediction (c.f.
Fig. 1). For each scenario, we used dedicated nested cross-validation setups to
ensure unbiased evaluation. In scenario (a), the predictions were made for individual held-out entries in dose–response matrices. The held-out entries were
selected at random for each cross-validation fold. In scenario (b), the predictions
were made for completely held out (dose–response matrix, cell line) pair, such that
the same drug combination had still been measured in other cell lines. This scenario corresponds to a widely-used strategy in other computational works concerning drug combination synergy prediction, in which the predictions are made
for new drug–drug-cell line triplets. In scenario (c), most challenging scenario of
new drug combination prediction, the predictions are made for novel drug combinations outside the training space with no available combination measurements.
In all prediction scenarios, we assumed that the monotherapy responses of the
single drugs in the combination are known.
To computationally evaluate the prediction performance and optimize the
model parameters, we performed a nested cross-validation procedure. In the first
prediction scenario of new dose–response matrix entry prediction, the crossvalidation folds were formed by simply random sampling from the tensor entries.
In the second prediction scenario concerning new dose–response matrices, the
folds were created by randomly sampling on the level of dose–response matrices,
i.e., if a drug pair-cell line triplet (xd1 ,xd2 ; xc ) belonged to the test set, the training
tensor did not include any entry involving the triplet (xd1 ,xd2 ; xc ). In the third
scenario of new drug combination prediction, the random sampling was performed
on the level of drug pairs and all the entries involving the test drug pairs were held
out from the training set, i.e., if a drug pair (xd1 ,xd2 ) belonged to the test set, the
training tensor did not contain any entry involving the pair (xd1 ,xd2 ). Furthermore,
we ensured that the individual drugs in the left out drug pairs are still observed
individually in other combinations in the training set, which enables the model to
learn from the way the individual drugs in the held out combinations act in other
combinations.
Drug combination anticancer activity dataset. The drug combination anticancer
activity dataset was obtained from a recent NCI-ALMANAC study4, which is the
largest available drug combination dataset to date. The original dataset covers over
5000 combinations of roughly 100 small molecule drugs screened against 60 cell
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
9
ARTICLE
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
lines in various concentrations, containing over 3 million response measurements.
The drugs included in the dataset are FDA-approved oncology drugs with proven
activity and established safety profiles. The cell lines represent human tumor cell
lines from the NCI-60 panel, originating from 9 different tissue types.
To reduce the computational complexity, we selected a subset of the NCIALMANAC dataset by randomly sampling 50 drugs (Supplementary Table 3) from
the original set of drugs, ensuring that the distribution of the subset of drug
combination responses matched to that of the original one. Furthermore, we
selected drug combinations for which complete measurements across all the 60 cell
lines were available. As a result, we obtained a dataset for our experiments
consisting of 617 drug combinations of 50 unique drugs, screened in 45 unique
concentrations against 60 cell lines, containing 333,180 response measurements for
combinations and 222,120 measurements for monotherapies, measured by
percentage growth of the cell line with respect to a control. Each drug combination
in the dataset had been screened using 4 × 4 dose–response matrix design.
incubation, 25 μl per well of CellTiter-Glo (Promega) reagent was added, and after
10 min of incubation at room temperature, luminescence (cell viability) was measured
using PheraStar plate reader (BMG Labtech).
Data representation. Defining an informative input feature representation of the
underlying data is essential to take the full advantage of comboFM and FMs in
general. By defining appropriate input features, FMs have been shown to have the
representation power encompassing a variety of matrix and tensor factorization
models from standard models to more specialized ones21,22. Hence, by learning
FMs, all the subsumed factorization models can also be learned.
In order to represent the structure of the tensor underlying the drug combination
response data as single input feature vectors, one-hot encoding is used. Here, the
input feature vectors x are divided into five different groups corresponding to the
different modes of the tensor: two sets of drugs, their concentrations, and a cell line. In
each group, exactly one value is set to 1 and the rest to 0, with 1 denoting the instance
that is present in the corresponding interaction:
Code availability
0
1
C
B
x ¼ @0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0A:
|fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl}
jDrugsj
jDrugsj
jConcentrationsj
jConcentrationsj
ð8Þ
Reporting summary. Further information on research design is available in the Nature
Research Reporting Summary linked to this article.
Data availability
The NCI-ALMANAC dataset is publicly available from National Cancer Institute (NCI)
at https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-ALMANAC. The preprocessed
data used in the computational experiments and in-house drug combination testing data
for validating comboFM predictions are available at https://doi.org/10.5281/
zenodo.4135059. Source data underlying the figures and display items are provided at
https://doi.org/10.5281/zenodo.4135059 subdirectory source_data.
The code is available at https://doi.org/10.5281/zenodo.4129688.
Received: 16 April 2020; Accepted: 5 November 2020;
References
1.
2.
jCell linesj
As the feature vector is non-zero only for the pair of drugs, drug concentrations,
and cell line present in the corresponding interaction, all the other interactions in the
FM model vanish and the model corresponds to standard factorization models
involving categorical variables. However, whereas standard factorization models are
limited to categorical input data only, comboFM and FMs can also incorporate
auxiliary features in addition to the information of the interacting elements, which can
further aid the prediction task, particularly when making predictions outside the
training space. In this work, we used chemical descriptors of molecules and genomic
descriptors of cell lines (see below for details).
Chemical descriptors. As chemical descriptors, we integrated molecular fingerprints, binary vectors which are designed to represent the structure of a molecule as
a series of bits, each one representing the presence or absence of a particular
substructure. We selected a popular fingerprint of type ‘estate’, consisting of 79 bits
corresponding to the E-State atom types originally defined by45, obtained from the
rcdk R package46. Fingerprint bits with zero variance across the dataset were
further removed, resulting in remaining 34 bits for the two sets of drugs.
Genomic descriptors. As genomic descriptors, we incorporated gene expression
profiles of the cancer cell lines, obtained from the rcellminer R package47. The gene
expression profiles were measured with five different platforms (four Affymetrix arrays
and an Agilent Whole Human Genome Oligo array) and a combined average z-score
was reported as a combined gene expression for a gene. To reduce the dimensionality
of the resulting feature matrix, we selected 0.5% of the genes with the highest variance
across the samples, resulting in 78 gene expression values for each cell line.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Cell lines. Early passage cells lines purchased from ATCC (HS-578T & Malme3M) and NCI-Frederick DCTD tumor/cell lines repository (SR & IGR-OV1) were
used for drug combination screening. The cell lines were maintained at 37 °C with
5% CO2 in a humidified incubator in their respective medium (see Supplementary
Table 4a). All the reagents were purchased from ThermoFisher Scientific. All the
cell lines were tested negative for mycoplasma. The test was based on the method
described by Choppa et al.48 and was performed as a service by the sample
management laboratory of THL Biobank, Helsinki, Finland.
13.
Drug combination screening. The drug combination testing experimental design
was adopted from Gautam et al.49. Seven different concentrations in log3-fold dilution
of two drugs were combined with each other in 8 × 8 matrix formats. Please refer to
Supplementary Tables 4b and c for the dug information and combinations design,
respectively. The compounds were plated to black clear bottom 384-well plates
(Corning #3764) using an Echo 550 Liquid Handler (Labcyte). 100 μM benzethonium
chloride (BzCl2) and 0.1% dimethyl sulfoxide (DMSO) were used as positive and
negative controls, respectively. All subsequent liquid handling was performed using
MultiFlo FX multi-mode dispenser (BioTek). The pre-dispensed compounds were
dissolved in 5 μl of culture media and left in a plate shaker at room temperature for
30 min. Twenty microliter cell suspension (please refer to Supplementary Table 4a for
cell line specific seeding densities) was dispensed in the drugged plates. After 72 h
17.
10
14.
15.
16.
18.
19.
20.
Al-Lazikani, B., Banerji, U. & Workman, P. Combinatorial drug therapy for
cancer in the post-genomic era. Nat. Biotechnol. 30, 679 (2012).
Masui, K. et al. A tale of two approaches: complementary mechanisms of
cytotoxic and targeted therapy resistance may inform next-generation cancer
treatments. Carcinogenesis 34, 725–738 (2013).
Lehár, J. et al. Synergistic drug combinations tend to improve therapeutically
relevant selectivity. Nat. Biotechnol. 27, 659–666 (2009).
Holbeck, S. L. et al. The national cancer institute almanac: a comprehensive
screening resource for the detection of anticancer drug pairs with enhanced
therapeutic activity. Cancer Res. 77, 3564–3576 (2017).
O’Neil, J. et al. An unbiased oncology compound screen to identify novel
combination strategies. Mol. Cancer Therapeutics 15, 1155–1162 (2016).
Day, D. & Siu, L. L. Approaches to modernize the combination drug
development paradigm. Genome Med. 8, 115 (2016).
Bulusu, K. C. et al. Modelling of compound combination effects and
applications to efficacy and toxicity: state-of-the-art, challenges and
perspectives. Drug Discov. Today 21, 225–238 (2016).
Ali, S., Tonekaboni, M., Ghoraie, L. S., Satya Kumar Manem, V. & HaibeKains, B. Predictive approaches for drug combination discovery in cancer.
Brief. Bioinforma. 19, 263–276 (2018).
Rampášek, L., Hidru, D., Smirnov, P. & GoldenbergDr, A. vae: improving drug
response prediction via modeling of drug perturbation effects. Bioinformatics
35, 3743–3751 (2019).
Paltun, B. G., Mamitsuka, H. & Kaski, S. Improving drug response prediction
by integrating multiple data sources: matrix factorization, kernel and networkbased approaches. Brief Bioinform. https://doi.org/10.1093/bib/bbz153 (2019).
Cichonska, A. et al. Learning with multiple pairwise kernels for drug
bioactivity prediction. Bioinformatics 34, i509–i518 (2018).
Cichonska, A. et al. Computational-experimental approach to drug-target
interaction mapping: a case study on kinase inhibitors. PLoS Computational
Biol. 13, e1005678 (2017).
Costello, J. C. et al. A community effort to assess and improve drug sensitivity
prediction algorithms. Nat. Biotechnol. 32, 1202 (2014).
Gertrudes, J. C. et al. Machine learning techniques and drug design. Curr.
Medicinal Chem. 19, 4289–4297 (2012).
Lavecchia, A. Machine-learning approaches in drug discovery: methods and
applications. Drug Discov. Today 20, 318–331 (2015).
Griner, L. A. M. et al. High-throughput combinatorial screening identifies
drugs that cooperate with ibrutinib to kill activated b-cell-like diffuse large bcell lymphoma cells. Proc. Natl Acad. Sci. USA 111, 2349–2354 (2014).
Sidorov, P., Naulaerts, S., Ariey-Bonnet, J., Pasquier, E. & Ballester, P.
Predicting synergism of cancer drug combinations using nci-almanac data.
Front. Chem. 7, 509 (2019).
Bansal, M. et al. A community computational challenge to predict the activity
of pairs of compounds. Nat. Biotechnol. 32, 1213 (2014).
Menden, M. P. et al. Community assessment to advance computational
prediction of cancer drug combinations in a pharmacogenomic screen. Nat.
Commun. 10, 2674 (2019).
Blondel, M., Fujino, A., Ueda, N. & Ishihata, M. Higher-order factorization
machines. In Advances in Neural Information Processing Systems, 3351–3359
(2016).
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z
21. Rendle, S. Factorization machines. In 2010 IEEE International Conference on
Data Mining, 995–1000 (IEEE, 2010).
22. Rendle, S. Factorization machines with libfm. ACM Trans. Intell. Syst.
Technol. (TIST) 3, 57 (2012).
23. Shoemaker, R. H. The nci60 human tumour cell line anticancer drug screen.
Nat. Rev. Cancer 6, 813–823 (2006).
24. Li, H., Li, T., Quang, D. & Guan, Y. Network propagation predicts drug
synergy in cancers. Cancer Res. 78, 5446–5457 (2018).
25. Jeon, M., Kim, S., Park, S., Lee, H. & Kang, J. In silico drug combination
discovery for personalized cancer therapy. BMC Syst. Biol. 12, 16 (2018).
26. Gayvert, K. M. et al. A computational approach for identifying synergistic
drug combinations. PLoS Comput. Biol. 13, e1005308 (2017).
27. Wildenhain, J. et al. Prediction of synergism from chemical-genetic
interactions by machine learning. Cell Syst. 1, 383–395 (2015).
28. Chen, L. et al. Prediction of effective drug combinations by chemical
interaction, protein interaction and target enrichment of kegg pathways.
BioMed Res. Int. 2013, 723780 (2013).
29. Morris, S. W. et al. Fusion of a kinase gene, alk, to a nucleolar protein gene,
npm, in non-hodgkin’s lymphoma. Science 263, 1281–1284 (1994).
30. Vlot, A. H. C., Aniceto, N., Menden, M. P., Ulrich-Merzenich, G. & Bender, A.
Applying synergy metrics to combination screening data: agreements,
disagreements and pitfalls. Drug Discov. Today 24, 2286–2298 (2019).
31. Palmer, A. C. & Sorger, P. K. Combination cancer therapy can confer benefit
via patient-to-patient variability without drug additivity or synergy. Cell 171,
1678–1691 (2017).
32. Boshuizen, J. & Peeper, D. S. Rational cancer treatment combinations: An
urgent clinical need. Mol. Cell 78, 1002–1018 (2020).
33. Grazia, G., Penna, I., Perotti, V., Anichini, A. & Tassi, E. Towards
combinatorial targeted therapy in melanoma: from pre-clinical evidence to
clinical application. Int. J. Oncol. 45, 929–949 (2014).
34. Suraweera, A., O’Byrne, K. J. & Richard, D. J. Combination therapy with
histone deacetylase inhibitors (hdaci) for the treatment of cancer: achieving
the full therapeutic potential of hdaci. Front. Oncol. 8, 92 (2018).
35. Haas, N. B. et al. Phase ii trial of vorinostat in advanced melanoma. Invest.
New Drugs 32, 526–534 (2014).
36. Ruzzolini, J. et al. Everolimus selectively targets vemurafenib resistant
brafv600e melanoma cells adapted to low ph. Cancer Lett. 408, 43–54 (2017).
37. Pak, E. et al. A large-scale drug screen identifies selective inhibitors of class i
hdacs as a potential therapeutic option for shh medulloblastoma. Neuro.
Oncol. 21, 1150–1163 (2019).
38. Gerner, R. E., Moore, G. E. & Didolkar, M. S. Chemotherapy of disseminated
malignant melanoma with dimethyl triazeno imidazole carboxamide and
dactinomycin. Cancer 32, 756–760 (1973).
39. Rocca, A. et al. A phase i–ii study of the histone deacetylase inhibitor valproic
acid plus chemoimmunotherapy in patients with advanced melanoma. Br. J.
Cancer 100, 28–36 (2009).
40. Tyner, J. W. et al. Functional genomic landscape of acute myeloid leukaemia.
Nature 562, 526–531 (2018).
41. Friedman, A. A., Letai, A., Fisher, D. E. & Flaherty, K. T. Precision medicine
for cancer with next-generation functional diagnostics. Nat. Rev. Cancer 15,
747–756 (2015).
42. Ianevski, A. et al. Prediction of drug combination effects with a minimal set of
experiments. Nat. Mach. Intell. 1, 568–577 (2019).
43. Haibe-Kains, B. et al. Inconsistency in large pharmacogenomic studies. Nature
504, 389–393 (2013).
44. Trofimov, M. & Novikov, A. TFFM: Tensorflow implementation of an
arbitrary order factorization machine. https://github.com/geffy/tffm (2016).
45. Hall, L. H. & Kier, L.B. The molecular connectivity chi indexes and kappa
shape indexes in structure-property modeling. Rev. Comput. Chem. 367–422
(1991).
46. Guha, R. et al. Chemical informatics functionality in r. J. Stat. Softw. 18, 1–16
(2007).
ARTICLE
47. Luna, A. et al. rcellminer: exploring molecular profiles and drug response of
the nci-60 cell lines in r. Bioinformatics 32, 1272–1274 (2015).
48. Choppa, P. C., Vojdani, A., Tagle, C., Andrin, R. & Magtoto, L. Multiplex pcr
for the detection of Mycoplasma fermentans, M. hominis and M. penetrans in
cell cultures and blood samples of patients with chronic fatigue syndrome.
Mol. Cell. Probes 12, 301–308 (1998).
49. Gautam, P. et al. Identification of selective cytotoxic and synthetic lethal drug
responses in triple negative breast cancer cells. Mol. Cancer 15, 34 (2016).
Acknowledgements
This work was supported by the Academy of Finland [ICT2023 programme grants
313268 to J.R.; 313266 to T.P. and 313267 to T.A. and grants 292611, 310507, 326238
to T.A.], the Cancer Society of Finland [T.A.]), the Sigrid Jusélius Foundation [T.A.],
and Orion Research Foundation sr [P.G.]. The authors thank the FIMM HTB unit and
especially Laura Turunen for their great help with the drug combination assays and
Aleksandr Ianevski for his great help with the synergy scoring and the background
distribution data for Fig. 4. The authors also acknowledge the computational resources
provided by the Aalto Science-IT project as well as CSC - IT Center for Science,
Finland.
Author contributions
H.J., T.A., A.C., S.S., T.P., and J.R. designed the research. H.J., A.C., T.P., J.R., and S.S.
developed computational methods and evaluation protocols. A.C., J.D. and H.J. performed computational evaluations. P.G. designed and performed experimental evaluation. H.J., A.C., P.G., J.R., and T.A. wrote the paper, contributed by T.P. and S.S.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information is available for this paper at https://doi.org/10.1038/s41467020-19950-z.
Correspondence and requests for materials should be addressed to T.A. or J.R.
Peer review information Nature Communications thanks Krishna Bulusu and the other,
anonymous, reviewer(s) for their contribution to the peer review of this work. Peer
reviewer reports are available.
Reprints and permission information is available at http://www.nature.com/reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative
Commons license, and indicate if changes were made. The images or other third party
material in this article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not included in the
article’s Creative Commons license and your intended use is not permitted by statutory
regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2020
NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications
11
Publication II
Heli Julkunen, Anna Cichońska, P. Eline Slagboom, Peter Würtz, Nightingale Health UK Biobank Initiative. Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general
population. eLife, May 2021.
© 2021 The Author(s). This is an open access article distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use and redistribution provided that the original author
and source are credited.
121
RESEARCH ARTICLE
Metabolic biomarker profiling for
identification of susceptibility to severe
pneumonia and COVID-19 in the general
population
Heli Julkunen1, Anna Cichońska1, P Eline Slagboom2,3, Peter Würtz1*,
Nightingale Health UK Biobank Initiative1
1
Nightingale Health Plc, Helsinki, Finland; 2Molecular Epidemiology, Department of
Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands;
3
Max Planck Institute for Biology of Ageing, Cologne, Germany
*For correspondence:
peter.wurtz@nightingalehealth.
com
Competing interest: See
page 17
Abstract Biomarkers of low-grade inflammation have been associated with susceptibility to a
severe infectious disease course, even when measured prior to disease onset. We investigated
whether metabolic biomarkers measured by nuclear magnetic resonance (NMR) spectroscopy could
be associated with susceptibility to severe pneumonia (2507 hospitalised or fatal cases) and severe
COVID-19 (652 hospitalised cases) in 105,146 generally healthy individuals from UK Biobank, with
blood samples collected 2007–2010. The overall signature of metabolic biomarker associations was
similar for the risk of severe pneumonia and severe COVID-19. A multi-biomarker score, comprised
of 25 proteins, fatty acids, amino acids, and lipids, was associated equally strongly with enhanced
susceptibility to severe COVID-19 (odds ratio 2.9 [95%CI 2.1–3.8] for highest vs lowest quintile) and
severe pneumonia events occurring 7–11 years after blood sampling (2.6 [1.7–3.9]). However, the
risk for severe pneumonia occurring during the first 2 years after blood sampling for people with
elevated levels of the multi-biomarker score was over four times higher than for long-term risk (8.0
[4.1–15.6]). If these hypothesis generating findings on increased susceptibility to severe pneumonia
during the first few years after blood sampling extend to severe COVID-19, metabolic biomarker
profiling could potentially complement existing tools for identifying individuals at high risk. These
results provide novel molecular understanding on how metabolic biomarkers reflect the
susceptibility to severe COVID-19 and other infections in the general population.
Funding: See page 17
Received: 11 September 2020
Accepted: 02 May 2021
Published: 04 May 2021
Reviewing editor: Edward D
Janus, University of Melbourne,
Australia
Copyright Julkunen et al. This
article is distributed under the
terms of the Creative Commons
Attribution License, which
permits unrestricted use and
redistribution provided that the
original author and source are
credited.
Introduction
The coronavirus disease 2019 (COVID-19) pandemic affects societies and healthcare systems worldwide. Protection of those individuals who are most susceptible to a severe and potentially fatal
COVID-19 disease course is a prime component of national policies, with stricter social distancing
and other preventative means recommended mainly for elderly people and individuals with preexisting disease conditions. The prominent susceptibility to severe COVID-19 for people at high age
has been linked with impaired immune response due to chronic inflammation caused by ageing processes (Akbar and Gilroy, 2020). However, large numbers of seemingly healthy middle-aged individuals also suffer from severe COVID-19 (Zhou et al., 2020; Atkins et al., 2020; Williamson et al.,
2020); this could partly be due to similar molecular processes related to impaired immunity. A better
understanding of the molecular factors predisposing to severe COVID-19 outcomes may help to
explain the risk elevation ascribed to pre-existing disease conditions. From a translational point of
view, this might also complement the identification of highly susceptible individuals in general population settings beyond current risk factor assessment.
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
1 of 20
Research article
Epidemiology and Global Health Medicine
eLife digest National policies for mitigating the COVID-19 pandemic include stricter measures
for people considered to be at high risk of severe and potentially fatal cases of the disease.
Although older age and pre-existing health conditions are strong risk factors, it is poorly understood
why susceptibility varies so widely in the population.
People with cardiometabolic diseases, such as diabetes and liver diseases, or chronic
inflammation are at higher risk of severe COVID-19 and other infections including pneumonia. These
conditions alter the molecules circulating in the blood, providing potential ‘biomarkers’ to
determine whether a person is more likely to develop a fatal infection. Uncovering these blood
biomarkers could help to identify people who are prone to life-threatening infections despite not
having ever been diagnosed with a cardiometabolic disease.
To find these biomarkers, Julkunen et al. studied blood samples that had been collected from
105,000 healthy individuals in the United Kingdom over ten years ago. The data showed that
individuals with biomarkers linked to low-grade inflammation and cardiometabolic disease were
more likely to have died or been hospitalised with pneumonia.
A score based on 25 of these biomarkers provided the best predictor of severe pneumonia. This
biomarker score performed up to four times better within the first few years after blood sampling
compared to predicting cases of pneumonia a decade later. The same blood biomarker changes
were also linked with developing severe COVID-19 over ten years after the blood samples had been
collected. The predictive value of the biomarker score was similar for both severe COVID-19 and the
long-term risk of severe pneumonia.
Julkunen et al. propose that the metabolic biomarkers reflect inhibited immunity that impairs
response to infections. The results from over 100,000 individuals suggest that these blood
biomarkers may help to identify people at high risk of severe COVID-19 or other infectious diseases.
Pneumonia is a life-threatening complication of COVID-19 and the most common diagnosis in
severe COVID-19 patients. As for COVID-19, the main factors that increase the susceptibility for
severe community-acquired pneumonia are high age and pre-existing respiratory and cardiometabolic diseases, which can weaken the lungs and the immune system (Almirall et al., 2017). Based on
analyses of large blood sample collections of healthy individuals, biomarkers associated with the risk
for severe COVID-19 are largely shared with the biomarkers associated with the risk for severe pneumonia, including elevated markers of impaired kidney function and inflammation and lower HDL cholesterol (Ho et al., 2020). This may indicate that these molecular markers may reflect an overall
susceptibility to severe complications after contracting an infectious disease.
Comprehensive profiling of metabolic biomarkers, also known as metabolomics, in prospective
population studies have suggested a range of blood biomarkers for cardiovascular disease and diabetes to also be reflective of the susceptibility for severe infectious diseases (Ritchie et al., 2015;
Deelen et al., 2019). Metabolic profiling could therefore potentially identify biomarkers that reflect
the susceptibility to severe COVID-19 among initially healthy individuals. However, such studies
require measurement of vast numbers of blood samples collected prior to the COVID-19 pre-pandemic. Conveniently, a broad panel of metabolic biomarkers have recently been measured using
nuclear magnetic resonance (NMR) spectroscopy in over 100,000 plasma samples from the UK
Biobank.
Here, we examined if NMR-based metabolic biomarkers from blood samples collected a decade
before the COVID-19 pandemic associate with the risk of severe infectious disease in UK general
population settings. Exploiting the shared risk factor relation between susceptibility to severe
COVID-19 and pneumonia (Ho et al., 2020), we used well-powered statistical analyses of biomarkers
with severe pneumonia events to develop a multi-biomarker score that condenses the information
from the metabolic measures into a single multi-biomarker score. Taking advantage of the timeresolved information on the occurrence of severe pneumonia events in the UK Biobank, we mimicked the influence of the decade lag from blood sampling to the COVID-19 pandemic on the biomarker associations, and used analyses with short-term follow-up to interpolate to a scenario of
identifying individuals susceptible to severe COVID-19 in a preventative screening setting. Our
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
2 of 20
Research article
Epidemiology and Global Health Medicine
primary aim was to improve the molecular understanding on how metabolic risk markers may contribute to increased predisposition to severe COVID-19 and other infections.
Results
A flow diagram of eligible study participants and case numbers is shown in Figure 1. Clinical characteristics of the study population are listed in Table 1. Among the 105,146 UK Biobank study participants with complete data on metabolic biomarkers and severe pneumonia outcomes, and no prior
history of diagnosed pneumonia, there were 2507 severe pneumonia events recorded in hospital or
death registries after the baseline blood sampling (median follow-up time 8.1 years).
For the severe COVID-19 analyses, there were 652 PCR-confirmed positive cases diagnosed in
hospital (inferred as severe cases in this study) among the 92,725 individuals with COVID-19 data
linkage available per 3rd of February 2021. The number of severe COVID-19 cases in the UK Biobank
closely followed the trends in hospitalised individuals for COVID-19 in England (Figure 1—figure
supplement 1). In February 2021, the age range of study participants was 49–84 years. The median
duration from blood sampling to the COVID-19 pandemic was 11.2 years (interquartile range 10.0–
12.6). The prevalence of chronic respiratory and cardiometabolic diseases was similar for study
UK Biobank, full cohort
n = 502 639
Exclude n = 384 190 without
metabolic biomarker data
Participants with baseline metabolic
biomarker data (random subset of
the full cohort)
n =118 462
Exclude biomarker outliers and
samples with missing values in 37
clinically validated biomarkers
(n =10 431)
Individuals with complete biomarker
data across clinically validated
biomarkers available
n = 108 031
Associations with severe pneumonia
Associations with severe COVID-19
Exclude individuals with
prevalent pneumonia or
penumonia recorded in primary
care settings or by self reports
(n = 2 889)
Participants with pneumonia outcome data;
n = 105 142; 2507 severe incident cases
(hospitalization or death)
Exclude individuals without COVID19 testing data (assessment centres
in Scotland and Wales) and
individuals who had died before
COVID-19 pandemic (n=15 306)
Participants with COVID-19 data available
n = 92 725; 653 severe incident cases
(hospitalization)
Figure 1. Flow diagram of study participants and case numbers. Overview of eligible study participants for the analysis of metabolic biomarkers for the
susceptibility to severe pneumonia and COVID-19 in the UK Biobank. Case and control definitions are described in Materials and methods.
The online version of this article includes the following figure supplement(s) for figure 1:
Figure supplement 1. Numbers of COVID-19 positive and hospitalised individuals in the UK Biobank and the whole of England during the course of
the COVID-19 pandemic.
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
3 of 20
Research article
Epidemiology and Global Health Medicine
Table 1. Clinical characteristics of the UK Biobank participants in the current study.
Severe pneumonia
(diagnosis in hospital or death
record)
Severe COVID-19
(diagnosis in hospital)
Incident cases
Controls
Incident cases
Individuals with NMR biomarker measures
2507
102 639
652
Controls
92 073
Age at blood sampling (median, [range])
62
[40-70]
58
[39-70]
60
[40-70]
58
[39-70]
Females (%)
44%
54%
43%
54%
Body mass index (mean, kg/m2)
28.5
27.4
28.7
27.3
Cardiovascular disease (%)
17.5%
6.6%
14.7%
6.4%
Diabetes (%)
9.3%
3.9%
9.2%
3.8%
Lung cancer (%)
0.4%
0.1%
0.3%
0.1%
Chronic obstructive pulmonary disease (%)
6.1%
0.7%
1.8%
0.8%
Liver diseases (%)
1.5%
0.7%
1.7%
0.7%
Renal failure (%)
3.6%
1.3%
2.9%
1.4%
Dementia (%)
0.1%
0.01%
0.0%
0.01%
Proportion with prevalent diseases
The number of individuals analysed for severe COVID-19 is slightly lower than for severe pneumonia, since COVID-19 data were not available from assessment centres in Scotland and Wales.
participants who developed severe pneumonia and those who contracted COVID-19 and required
hospitalisation, with the exception of COPD. There were 33 overlapping cases between severe pneumonia and COVID-19.
Metabolic biomarkers and severe pneumonia risk
Figure 2A shows the associations of 37 biomarkers with severe pneumonia events occurring during
the follow-up in the entire study population (n = 105 146). The biomarkers highlighted here are
those with a regulatory approval for diagnostics use in the Nightingale Health NMR platform. These
biomarkers span most of the different metabolic pathways captured with the NMR platform; results
for all 249 metabolic measures quantified are shown in Figure 2—figure supplements 1–3. Strong
associations were observed across several metabolic pathways: increased plasma concentrations of
cholesterol measures, omega-3 and omega-6 fatty acid levels, histidine, branched-chain amino acids
and albumin were associated with lower susceptibility to contracting severe pneumonia. Increased
concentrations of monounsaturated and saturated fatty acids, as well glycoprotein acetyls (GlycA, a
marker of low-grade inflammation) were associated with elevated susceptibility to contracting severe
pneumonia.
Since all the biomarkers are quantified in the same single measurement, we examined if even
stronger associations with severe pneumonia could be obtained using a combination of multiple biomarkers. We derived this multi-biomarker combination, denoted ‘infectious disease score’, using
logistic regression with LASSO for variable selection, considering the 37 clinically validated biomarkers in a half of the study population as the training set. This resulted in an infectious disease
score comprised of the weighted sum of 25 biomarkers, with the weights selected by the machine
learning algorithm (Supplementary file 1). Broadly similar results were obtained using all 249 metabolic measures quantified in the Nightingale Health NMR platform to derive the multi-biomarker
score.
The multi-biomarker infectious disease score was then tested for association with severe pneumonia in the other half of the study population. The magnitude of association for the infectious disease
score was approximately twice as strong with severe pneumonia compared to any of the individual
biomarkers (Figure 2B). The odds for contracting severe pneumonia was increased 67% per 1-SD
increment in the infectious disease score. This corresponds to close to fourfold higher risk for
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
4 of 20
Research article
Epidemiology and Global Health Medicine
A
Lipoprotein lipids
Total−C #
VLDL−C
LDL−C #
HDL−C
Triglycerides #
●
●
●
●
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
1.0
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
Apolipoproteins
ApoB
ApoA1
ApoB/ApoA1 #
●
●
●
0.7
0.8
0.9
Fatty acids
Total fatty acids
Omega−3 #
Omega−6
PUFA #
MUFA #
SFA
DHA
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
Fatty acid ratios
Omega−3 %
Omega−6 % #
PUFA % #
MUFA %
SFA % #
DHA % #
PUFA/MUFA
Omega−6/Omega−3 #
●
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
1.4 1.5 1.6 1.7 1.8
Amino acids
Alanine #
Glycine #
Histidine #
Isoleucine #
Leucine #
Valine #
Phenylalanine #
Tyrosine #
Total BCAA
●
●
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
Glycolysis metabolites
Glucose #
Lactate #
●
●
0.7
0.8
0.9
1.0
0.9
1.0
1.1
1.2
1.3
0.9
1.0
1.1
1.2
1.3
Fluid balance
Creatinine #
Albumin #
●
●
0.7
0.8
Inflammation
Glycoprotein acetyls #
●
0.7
0.8
1.4 1.5 1.6 1.7 1.8
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in biomarker level
B
Infectious disease score
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4 1.5 1.6 1.7 1.8
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
Figure 2. Relation of baseline biomarker concentrations to future risk of severe pneumonia in the UK Biobank (n = 105 146; 2507 incident events). (A)
Odds ratios with severe pneumonia (2507 hospitalisations or deaths during a median of 8 years of follow-up) for 37 clinically validated biomarkers
measured simultaneously in a single assay by Nightingale Health NMR platform. (B) Odds ratio with severe pneumonia for the multi-biomarker
infectious disease score. The infectious disease score comprises of the weighted sum of 25 out of 37 clinically validated biomarkers, optimised for
Figure 2 continued on next page
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
5 of 20
Research article
Epidemiology and Global Health Medicine
Figure 2 continued
association with severe pneumonia based on one half of the study population using LASSO regression. Biomarkers included in the infectious disease
score are marked by #. The odds ratio for infectious disease score is evaluated in the other half of the study population (n = 52 573; 1250 events). All
models are adjusted for age, sex, and assessment centre. Odds ratios are per 1-SD increment in the biomarker levels. Horizontal bars denote 95%
confidence intervals. Closed circles denote p-value<0.001 and open circles p-value0.001. BCAA indicates branched-chain amino acids; DHA:
docosahexaenoic acid; MUFA: monounsaturated fatty acids; PUFA: polyunsaturated fatty acids; SFA: saturated fatty acids.
The online version of this article includes the following source data and figure supplement(s) for figure 2:
Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 2.
Figure supplement 1. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank
(n = 105 146; 2507 events).
Figure supplement 2. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank
(n = 105 146; 2507 events).
Figure supplement 3. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank
(n = 105 146; 2507 events).
contracting severe pneumonia among people in the highest quintile of the infectious disease score,
compared to those with a score in the lowest quintile.
To assess the robustness of the multi-biomarker score association with severe pneumonia, we
adjusted the analyses for prevalent diseases and performed analyses stratified by age and sex (Figure 3). The association was attenuated by ~10% in magnitude when adjusting for, or omitting, individuals with a diagnosis of prevalent diseases at time of blood sampling (cardiovascular diseases,
diabetes, lung cancer, COPD, liver diseases, renal failure, and dementia; panels 3A and 3B). The
association was similar across age groups, and also for men and women analysed separately (panels
3C and 3D).
To mimic the influence of the decade-long lag from blood sample collection to the COVID-19
pandemic, we tested the association of the multi-biomarker infectious disease score with severe
pneumonia events occurring during 7–11 years after the blood sampling (Figure 4A). Since there
were only few severe pneumonia events recorded with more than 9 years of follow-up, we could not
fully mimic the decade long time lag to the COVID-19 pandemic. The risk elevation observed in this
time-lag accounting scenario was only approximately half of that observed for severe pneumonia
events occurring within the first 7 years (odds ratio 1.43 vs 1.75 per 1-SD, respectively; and 2.59 vs
4.27 for individuals in the highest vs lowest quintile of the infectious disease score).
To interpolate to a screening scenario conducted today, we also tested the association with
short-term risk of severe pneumonia by analysing events occurring within the first 2 years after the
blood sampling (Figure 4B). The association magnitude in this analysis of short-term risk scenario
was approximately twice as strong as for severe pneumonia events occurring more than 2 years after
blood sampling (odds ratio 2.21 vs 1.59 per 1-SD; 7.95 vs 3.35 for individuals in the highest vs lowest
quintile of the infectious disease score). The elevated susceptibility to severe pneumonia associated
with the multi-biomarker score was therefore three to four times stronger when examining shortterm risk as compared to risk of severe pneumonia events occurring almost a decade after the blood
sampling.
The elevation in the short-term risk for severe pneumonia for high levels of the infectious disease
multi-biomarker score remained strong when adjusting for BMI, smoking and prevalent diseases
(odds ratio 6.10 for individuals in the highest vs lowest quintile; Figure 4—figure supplement 1).
We further explored the risk gradient for a future onset of severe pneumonia along increasing levels of the infectious disease score, since non-linear effects could potentially facilitate the identification of thresholds for individuals at high susceptibility. Figure 5A shows the increase in the
proportion of individuals who contracted severe pneumonia according to percentiles of the score.
The risk increased prominently in the highest quintile, and particularly for the highest few percentiles. The time-resolved plot of the cumulative probability of severe pneumonia during follow-up is
shown in Figure 5B. The susceptibility to severe pneumonia was particularly elevated among individuals with the very highest levels of the multi-biomarker infectious disease score. This was observed
already during the first few years of follow-up, corroborating the results for long-term and shortterm risk shown in Figure 3. The prominent and immediate elevation in susceptibility to severe
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
6 of 20
Research article
Epidemiology and Global Health Medicine
A Additional adjustments
●
●
●
Additional adjustments
Age, sex and assessment center
Age, sex, assmt. center, BMI and smoking status
Age, sex, assmt. center, BMI, smoking status and prevalent diseases
1.0
1.1
1.2
1.3
1.4
●
●
●
1.5
Age, sex, assmt. center, BMI, smoking status and prevalent disease
Age, sex, assmt. center, BMI and smoking status
Age, sex and assessment center
●
●
1.6
1.7
1.8
1.9
2.0
1
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
2
●
3
4
5
6
7
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
B
Individuals with and without prevalent diseases
●
●
All individuals
Without prevalent diseases
Without prevalent diseases
All individuals
●
1.0
Individuals with and without prevalent diseases
1.1
1.2
1.3
1.4
1.5
1.6
●
1.7
1.9
2.0
1
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
●
●
39−53
53−61
61−70
4
5
6
7
61−70
53−61
39−53
1.2
1.3
1.4
1.5
1.6
●
●
●
1.1
3
Age at blood sampling
●
1.0
2
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
C Age at blood sampling
●
●
●
1.8
1.7
1.8
●
●
1.9
2.0
1
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
2
3
4
5
6
7
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
D
Men and women separately
●
●
Men and women separately
Men
Women
Women
Men
●
●
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
●
1.8
1.9
2.0
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
1
2
3
●
4
5
6
7
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
Figure 3. Relation of the multi-biomarker infectious disease score to future risk of severe pneumonia with additional adjustments and in subgroups
(n = 52 573; 1250 incident events). (A) Odds ratios with severe pneumonia after additional adjustments for BMI, smoking status, and prevalent diseases.
(B) Odds ratios with severe pneumonia in study participants with and without prevalent diseases. (C) Odds ratios by age tertiles at the time of blood
sampling. (D) Odds ratios for men and women separately. All models are adjusted for age, sex, and assessment centre. The left-hand side shows odds
ratios per 1-SD increment in the multi-biomarker infectious disease score, and the right-hand side odds ratios for comparing individuals in the highest
Figure 3 continued on next page
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
7 of 20
Research article
Epidemiology and Global Health Medicine
Figure 3 continued
and lowest quintiles of the score. The results are based on the validation half of the study population not used in deriving the infectious disease score
(1250 events during a median of 8 years of follow-up).
The online version of this article includes the following source data for figure 3:
Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 3.
pneumonia was also observed when limiting analyses to individuals without chronic respiratory and
cardiometabolic diseases at the time of blood sampling (Figure 5—figure supplement 1).
Metabolic biomarkers and severe COVID-19
Figure 6 shows the associations of the 37 clinically validated biomarkers and the infectious disease
score with the future onset of severe COVID-19 (defined as PCR-confirmed positive inpatient diagnosis). Many of the individual biomarkers had significant associations (p-value<0.001) with increased
risk for severe COVID-19. These biomarkers for susceptibility to severe COVID-19 include lower levels of omega-3 omega-6 fatty acids as well as albumin, and higher levels of GlycA. We observed a
high concordance in the overall pattern of COVID-19 biomarker associations with severe pneumonia
(Figure 2A), with a Spearman correlation of 0.89 between the overall biomarker association signatures for severe pneumonia and severe COVID-19 (Figure 7).
A
Mimicking the decade lag to the COVID−19 pandemic
●
●
Within 7 years from blood sampling (943 events)
7−11 years from blood sampling (307 events)
1.2
1.4
7−11 years from blood sampling (307 events)
Within 7 years from blood sampling (943 events)
●
●
1.0
Mimicking the decade lag to the COVID−19 pandemic
1.6
●
●
1.8
2.0
2.2
2.4
2.6
1
2
3
4
5
6
7
10
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
Mimicking a preventative screening scenario carried out today
Mimicking a preventative screening scenario carried out today
B
●
●
Within 2 years from blood sampling (162 events)
2−11 years from blood sampling (1088 events)
●
●
1.0
1.2
1.4
1.6
2−11 years from blood sampling (1088 events)
Within 2 years from blood sampling (162 events)
1.8
2.0
2.2
●
●
2.4
2.6
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in infectious disease score
1
2
3
4
5
6
7
10
Odds ratio for severe pneumonia (95% CI),
highest vs. lowest quintile of infectious disease score
Figure 4. Relation of the multi-biomarker infectious disease score to long-term and short-term future risk for severe pneumonia (n = 52 573; 1250
incident events). (A) Odds ratios with severe pneumonia events occurring within the first 7 years after the blood sampling, compared to events that
occurred 7–11 years after blood sampling. (B) Odds ratios for severe pneumonia occurring within and after the first 2 years of blood sampling. Models
are adjusted for age, sex, and assessment centre. The left-hand side shows odds ratios per 1-SD increment in the multi-biomarker infectious disease
score, and the right-hand side odds ratios for comparing individuals in the highest and lowest quintiles of the score. The results are based on the
validation half of the study population that was not used in deriving the infectious disease score.
The online version of this article includes the following source data and figure supplement(s) for figure 4:
Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 4.
Figure supplement 1. Relation of the multi-biomarker infectious disease score to long-term and short-term risk for severe pneumonia after adjustment
for risk factors and prevalent diseases (n = 52 573; 1250 events).
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
8 of 20
Research article
Epidemiology and Global Health Medicine
B
●
10.0%
●●●
●
●
●
5.0%
0.0%
●●
●
●● ●
●
●
●● ● ● ● ● ●●●● ●
●●
●
●
●
●
● ●
●
●● ●
●
● ●● ●●
●
● ● ●
●
● ●●●
●
● ● ●●
●● ● ●●
●●● ●
●●
●●●
●
●
●●
●
● ●●
●
● ● ● ●● ●
●● ● ●
●●
●
●
●
0
20
40
60
80
●
●●●
100
Cumulative incidence
Percentage of severe pneumonia cases
A
Infectious disease score
percentile range
10.0%
0−80%
80−90%
90−95%
95−97.5%
5.0%
97.5−100%
0.0%
0
Infectious disease score percentile
2.5
5
7.5
Follow−up time (years)
Figure 5. Risk gradient for contracting severe pneumonia after the blood sampling according to percentiles of the multi-biomarker infectious disease
score (n = 52 573; 1250 incident events). (A) Proportion of individuals who contracted severe pneumonia during a median follow-up time of 8.1 years
after the blood sampling according to percentiles of the multi-biomarker infectious disease score. Each point represents approximately 500 individuals.
(B) Kaplan-Meier curves of the cumulative probability for severe pneumonia in quantiles of the multi-biomarker infectious disease score. The follow-up
time was truncated at 9.5 years since only a small fraction of individuals were followed longer. Results are based on the validation half of the study
population that was not used in deriving the infectious disease score (n = 52,573). The corresponding plots for individuals free of baseline respiratory
and cardiometabolic diseases are shown in Figure 5—figure supplement 1.
The online version of this article includes the following source data and figure supplement(s) for figure 5:
Source data 1. Numerical tabulation of event rates for each percentile in Figure 5A.
Figure supplement 1. Risk gradients for contracting severe pneumonia by percentiles of the multi-biomarker infectious disease score among
individuals without prevalent diseases at time of blood sampling (n = 46,252; 877 events).
The multi-biomarker infectious disease score derived for the future onset of severe pneumonia
was also robustly associated with the future onset of severe COVID-19. The odds ratio was 1.40 per
1-SD increment and 2.90 for comparing individuals in the highest quintile of the multi-biomarker
infectious disease score to those in the lowest quintile. This magnitude of association with susceptibility to severe COVID-19 was similar to that observed with severe pneumonia events occurring during the interval of 7–11 years after the blood sampling.
We further examined the association of the multi-biomarker infectious disease score with severe
COVID-19 after adjustment or exclusion for prevalent diseases, and conducted stratified analyses for
age and sex (Figure 8). The association with severe COVID-19 was attenuated, but remained significant when adjusted for BMI, smoking and prevalent diseases (panel 7A). The association magnitudes
were approximately 20% weaker when limiting the COVID-19 analyses to individuals without prevalent diseases at time of blood sampling (panel 7B). There was no robust evidence of differences in
association magnitude according to age (panel 7C) and odd ratios were broadly similar for men and
women (panel 7D).
Finally, we examined the technical repeatability and biological stability of measuring the multibiomarker infectious disease score. The measurement repeatability was high (Pearson correlation
0.94 in blind duplicate samples; Figure 9A). Even though the blood samples were primarily non-fasting, the levels of the infectious disease score remained broadly stable during 4 years based on blood
samples from repeat visits (Pearson correlation 0.61 between baseline and repeat visit measurements; Figure 9B).
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
9 of 20
Research article
Epidemiology and Global Health Medicine
A
Lipoprotein lipids
Total−C #
VLDL−C
LDL−C #
HDL−C
Triglycerides #
●
●
●
●
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.0
1.1
1.2
1.3
1.4
1.5
1.2
1.3
1.4
1.5
Apolipoproteins
ApoB
ApoA1
ApoB/ApoA1 #
●
●
●
0.7
0.8
0.9
Fatty acids
Total fatty acids
Omega−3 #
Omega−6
PUFA #
MUFA #
SFA
DHA
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
1.1
Fatty acid ratios
Omega−3 %
Omega−6 % #
PUFA % #
MUFA %
SFA % #
DHA % #
PUFA/MUFA
Omega−6/Omega−3 #
●
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.2
1.3
1.4
1.5
1.2
1.3
1.4
1.5
1.3
1.4
1.5
1.3
1.4
1.5
Amino acids
Alanine #
Glycine #
Histidine #
Isoleucine #
Leucine #
Valine #
Phenylalanine #
Tyrosine #
Total BCAA
●
●
●
●
●
●
●
●
●
0.7
0.8
0.9
1.0
0.9
1.0
1.1
Glycolysis metabolites
Glucose #
Lactate #
●
●
0.7
0.8
1.1
Fluid balance
Creatinine #
Albumin #
●
●
0.7
0.8
0.9
1.0
1.1
1.2
0.8
0.9
1.0
1.1
1.2
Inflammation
Glycoprotein acetyls #
●
0.7
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in biomarker level
B
Infectious disease score
●
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
Figure 6. Relation of baseline biomarkers and multi-biomarker infectious disease score to future risk of severe COVID-19 (n = 92 725; 652 cases
diagnosed in hospital). (A) Odds ratios with severe COVID-19 (defined as PCR-positive diagnosis in hospital; 652 cases out of 92 725 individuals) for 37
clinically validated biomarkers measured by NMR. (B) Odds ratio with severe COVID-19 for the multi-biomarker infectious disease score. Biomarkers
Figure 6 continued on next page
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
10 of 20
Research article
Epidemiology and Global Health Medicine
Figure 6 continued
included in the infectious disease score are marked by #. Odds ratios are per 1-SD increment in the biomarker levels. Models are adjusted for age, sex,
and assessment centre.
The online version of this article includes the following source data for figure 6:
Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 6.
Discussion
Most biomarker studies on COVID–19 have focused on characterising already infected patients and
their disease prognosis (Kermali et al., 2020; Shen et al., 2020; Messner et al., 2020;
Dierckx et al., 2020). In contrast, in the largest blood metabolic profiling study to date, we explored
biomarker associations for susceptibility to severe pneumonia and COVID-19 in general population
settings. We developed a multi-biomarker score for increased susceptibility to a severe infectious
disease course, and demonstrated that this biomarker score captures an increased risk for COVID-19
hospitalisation a decade after the blood sampling.
The overall signature of biomarker associations was similar for the susceptibility to severe COVID19 and to severe pneumonia (Figure 7). The proportions of individuals with existing cardiometabolic
diseases were also consistent for both of these infectious diseases (Table 1). We used these observations of a shared risk factor basis to draw an analogy between susceptibility to severe pneumonia
and severe COVID-19, and hereby infer potential implications for preventative screening. We therefore exploited the strong statistical power and time-resolved information on severe pneumonia
events for more detailed analyses than was feasible with COVID-19. This led to three important
observations. First, the infectious disease multi-biomarker score was largely independent of prevalent chronic respiratory and cardiometabolic diseases (Figure 3). Second, the susceptibility to severe
pneumonia was drastically elevated in the extreme tail of the multi-biomarker infectious disease
score, with 5–10 times higher risk compared to individuals with normal levels of the multi-biomarker
score (Figure 5). Such features might aid in establishing thresholds for identifying individuals most
susceptible to a severe disease course. Third, the odds ratio of the multi-biomarker score for severe
pneumonia events occurring after 7–11 years closely matched that of severe COVID-19, for which all
events occurred over decade after blood sampling (Figure 4A). Yet, screening for the susceptibility
to severe COVID-19 would require a strong association with the short-term risk. When confining the
analyses of severe pneumonia to events occurring within the first 2 years after blood sampling, the
short-term risk elevation was over four times stronger than that observed for long-term risk — individuals with high levels of the multi-biomarker score were almost 7-times more susceptible than people with low levels (Figure 4B). If similar enhancement in short-term risk extend to COVID-19, our
results could potentially indicate applications for identification of individuals at high susceptibility to
a severe COVID-19 disease course. However, the unavailability of metabolic biomarker data from
blood samples drawn shortly prior to the pandemic prevents us from examining biomarker associations with short-term COVID-19 susceptibility, and our results should therefore be considered of
hypothesis generating nature.
We observed multiple blood biomarkers commonly linked with the risk for cardiovascular disease
and diabetes (Soininen et al., 2015; Würtz et al., 2017; Holmes et al., 2018; Ahola-Olli et al.,
2019) to also be associated with increased susceptibility to both severe pneumonia and severe
COVID-19. The biomarkers span multiple metabolic pathways, including low concentrations of lipoprotein lipids, impaired fatty acid balance, decreased amino acid levels and high chronic inflammation. This is the first study to show that many of these blood biomarkers associate with susceptibility
to severe infections, potentially indicating that fatty acids and amino acids should not be considered
only as biomarkers for cardiometabolic risk. The associations of omega-3 and other fatty acids with
the risk for severe COVID-19 may be particularly important, as these measures are more directly
modifiable by lifestyle means than common markers of inflammation. The overall pattern of biomarker associations followed a characteristic metabolic signature reflective of an increased susceptibility to a severe infectious disease. This pattern of biomarker associations is broadly similar to what
has previously been reported with the risk for all-cause mortality in smaller prospective cohort studies (Deelen et al., 2019). It is therefore unlikely that the identified biomarker signature is specific to
the risk for severe pneumonia and COVID-19, or even specific to infectious diseases in general. We
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
11 of 20
Research article
1.4
Epidemiology and Global Health Medicine
Spearman correlation: 0.89
Glycoprotein acetyls
●
●
MUFA %
Odds ratio for severe pneumonia (95% CI),
per 1−SD increment in biomarker level
●
●
Omega−6/Omega−3
●
●
1.2
SFA %
●
●
●
●
Lactate
Creatinine
MUFA ●
●
Phenylalanine ● Glucose
●
●
Tyrosine ●
1.0
Triglycerides
Glycine SFA ●
●
Alanine
●●
ApoB/ApoA1
●●
Isoleucine
VLDL−C Total fatty acids Total BCAA
●
●
●
●
%
ApoA1 Omega−6
●
Leucine Valine
●
ApoB
●
HDL−C PUFA % ● Omega−6
●
●
Total−C ●● LDL−C
0.8 PUFA/MUFA ● ●●
Histidine
●
●
Omega−3 % ● PUFA
●
DHA % ● Omega−3 Albumin
DHA
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
0.8
1.0
1.2
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in biomarker level
1.4
Figure 7. Concordance of the overall pattern of biomarker associations with future onset of severe pneumonia and severe COVID-19. Biomarker
associations with future onset of severe pneumonia (y-axis) plotted against the corresponding associations with severe COVID-19 (x-axis). The odds
ratios, with adjustment for age, sex, and assessment centre, for each of the 37 clinically validated biomarkers in the Nightingale Health NMR platform
are given with 95% confidence intervals in vertical and horizontal error bars. The dashed line denotes the diagonal.
The online version of this article includes the following source data for figure 7:
Source data 1. Numerical tabulation of odds ratios, and 95% confidence intervals for results shown in Figure 7.
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
12 of 20
Research article
A
Epidemiology and Global Health Medicine
Additional adjustments
Additional adjustments
●
●
●
Age, sex and assessment center
Age, sex, assmt. center, BMI and smoking status
Age, sex, assmt. center, BMI, smoking status and prevalent diseases
1.0
1.1
●
●
●
1.2
Age, sex, assmt. center, BMI, smoking status and prevalent disease
Age, sex, assmt. center, BMI and smoking status
Age, sex and assessment center
1.3
1.4
1.5
1.6
1.7
1.8
1
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
●
●
●
2
3
4
5
6
7
6
7
6
7
6
7
Odds ratio for severe COVID−19 (95% CI),
highest vs. lowest quintile of infectious disease score
B
Individuals with and without prevalent diseases
●
●
All individuals
Without prevalent diseases
Without prevalent diseases
All individuals
●
●
1.0
Individuals with and without prevalent diseases
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
C
●
●
49−63
63−72
72−84
4
5
72−84
63−72
49−63
1.2
1.3
●
●
●
1.1
3
Age at the time of COVID−19 pandemic
●
1.0
2
Odds ratio for severe COVID−19 (95% CI),
highest vs. lowest quintile of infectious disease score
Age at the time of COVID−19 pandemic
●
●
●
1.4
1.5
1.6
1.7
1.8
1
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
●
2
●
3
4
5
Odds ratio for severe COVID−19 (95% CI),
highest vs. lowest quintile of infectious disease score
D
Men and women separately
●
●
Men and women separately
Men
Women
Women
Men
●
1.0
1.1
1.2
1.3
●
●
1.4
1.5
1.6
1.7
1.8
Odds ratio for severe COVID−19 (95% CI),
per 1−SD increment in infectious disease score
1
2
●
3
4
5
Odds ratio for severe COVID−19 (95% CI),
highest vs. lowest quintile of infectious disease score
Figure 8. Relation of the multi-biomarker infectious disease score to future risk of severe COVID-19 with additional adjustments and in subgroups of
the study population (n = 92,725; 652 cases diagnosed in hospital). (A) Odds ratios with severe COVID-19 after additional adjustments for BMI, smoking
status and prevalent diseases. (B) Odds ratios with severe pneumonia in study participants with and without prevalent diseases at the time of blood
sampling. (C) Odds ratios by age tertiles at the time of the COVID-19 pandemic. (D) Odds ratios for men and women, separately. The left-hand side
shows the odds ratios per 1-SD increment in the multi-biomarker infectious disease score, and the right-hand side the odds ratios for comparing
individuals in the highest and lowest quintiles of the score. All models are adjusted for age, sex, and assessment centre.
Figure 8 continued on next page
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
13 of 20
Research article
Epidemiology and Global Health Medicine
Figure 8 continued
The online version of this article includes the following source data for figure 8:
Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 8.
propose that the overall metabolic biomarker perturbations observed here reflect molecular signals
of low-grade inflammation that exacerbate disease severity, in case of both infectious and chronic
diseases (Akbar and Gilroy, 2020; Bonafè et al., 2020). In line with this, prior studies have demonstrated that elevated levels of GlycA, the biomarker with the strongest weight in the infectious disease score, is associated with increased neutrophil activity and the long-term risk for fatal infections
(Ritchie et al., 2015). Such over-activity of immune response from pneumonia or COVID-19 infection
is known to cause tissue damage and organ dysfunction through cytokine storm, a common complication of severe COVID-19 (Mangalmurti and Hunter, 2020). While the specific biological mechanisms underpinning the blood metabolic biomarker associations with chronic and infectious diseases
remain poorly understood, we emphasize that the observational character of our study does not
allow us to conclude whether the biomarkers are contributing causally to increase the risk or are
merely indirect risk markers.
Replication of novel biomarker associations is a key aspect in observational studies. We are not
aware of other prospective studies with sufficient COVID-19 hospitalisation events and NMR-based
metabolic biomarker data to address this. However, a preprint of the present study featured analysis
of 195 severe COVID-19 cases, based on data available in UK Biobank back in June 2020
(Julkunen et al., 2020). In the present updated analyses, with over three times the number of cases,
all biomarker associations with susceptibility to severe COVID-19 were similar or stronger, and
Figure 9. Technical repeatability for measuring the multi-biomarker infectious disease score and biological stability in repeat measures 4 years after the
baseline blood sampling. (A) Technical repeatability of the infectious disease score assessed in blind duplicate samples. The correlation plot is based
on 2863 blind duplicate plasma samples measured along with the regular measurements of ~105,000 samples in the Nightingale Health-UK Biobank
initiative. (B) Biological stability of the infectious disease score based on plasma samples from 1298 individuals who attended both the baseline visit as
well as a repeat visit ~4 years later at the UK Biobank assessment centres.
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
14 of 20
Research article
Epidemiology and Global Health Medicine
hereby provide a within-cohort replication of our initial findings. In addition, a recent study used the
same metabolic biomarker panel in three cohorts of hospitalised patients and observed similar overall biomarker perturbations to be predictive of COVID-19 severity (Dierckx et al., 2020). The study
also reported the multi-biomarker infectious disease score to be among the strongest biomarkers
for discriminating COVID-19 severity among already hospitalised patients.
Our study has both strengths and limitations. Strengths include the large sample size, which
enabled the analysis of biomarkers for susceptibility to severe COVID-19 based on pre-pandemic
blood samples from general population settings. We used a validated metabolic profiling platform
that enables simultaneous quantification of numerous metabolic biomarkers in a scalable low-cost
setup. Although the number of hospitalised COVID-19 cases was in line with the prevalence in England, we acknowledge that the statistical power was limited for prediction analyses even with close
to 100,000 samples linked with COVID-19 outcome data. Furthermore, the UK Biobank study participants are not fully representative of the UK population by demographic characteristics; the individuals were enrolled on a volunteer basis and are therefore more representative of healthier individuals
than average (Sudlow et al., 2015; Fry et al., 2017). Even though this is generally not a concern for
investigating risk associations (Keyes and Westreich, 2019), it does limit the statistical power to
explore effects of ethnicity and old age. Other limitations include the decade long duration from
blood sampling to the COVID-19 pandemic. While this limits inference on how well the biomarkers
predict short-term risk for severe COVID-19, our analogy with long-term risk for severe pneumonia
indicates that the time lag likely attenuates the biomarker association magnitudes substantially. Conversely, the remarkably strong associations for short-term risk of severe pneumonia led us to speculate that similar enhancements in association magnitudes could also hold for severe COVID-19.
However, this inference should be further tested, in particular in the light of the bacterial origin of
many severe pneumonia cases and the viral origin of COVID-19. Weaker biomarker associations for
severe COVID-19 compared to severe pneumonia may also arise from the UK Biobank COVID-19
data being influenced by ascertainment bias in terms of differential healthcare seeking and differential testing (Griffith et al., 2020), whereas pneumonia is anticipated to have nearly complete case
ascertainment (Ho et al., 2020).
In conclusion, a metabolic signature of perturbed blood biomarkers is associated with an
increased susceptibility to both severe pneumonia and COVID-19 in blood samples collected a
decade before the pandemic. The multi-biomarker score captures an elevated susceptibility to
severe pneumonia within few years after blood sampling that is several times stronger than the risk
elevation associated with many pre-existing health conditions, such as obesity and diabetes
(Ho et al., 2020). If the three- to fourfold elevation in short-term risk compared to long-term risk of
severe pneumonia also applies to severe COVID-19, then the metabolic biomarker profiling could
potentially complement existing tools for identifying individuals most susceptible to a severe
COVID-19 disease course. Regardless of the translational prospects, these results provide novel
understanding on how metabolic biomarkers may reflect the susceptibility of severe COVID-19 and
other infections.
Materials and methods
Study population
Details of the design of the UK Biobank have been reported previously (Sudlow et al., 2015). Briefly,
the UK Biobank recruited 502,639 participants aged 37–70 years in 22 assessment centres across the
UK. All study participants had to be able to attend the assessment centres by their own means, and
there was no enrolment at nursing homes. All participants provided written informed consent and
ethical approval was obtained from the North West Multi-Center Research Ethics Committee. Blood
samples were drawn at baseline between 2007 and 2010. The current analysis was approved under
UK Biobank Project 30418. No selection criteria were applied to the sampling.
Metabolic biomarker profiling
From the entire UK Biobank population, a random subset of non-fasting baseline plasma samples
(aliquot 3) from 118 466 individuals and 1298 repeat-visit samples were measured using highthroughput NMR spectroscopy (Nightingale Health Plc; biomarker quantification version 2020). This
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
15 of 20
Research article
Epidemiology and Global Health Medicine
provides simultaneous quantification of 249 metabolic biomarker measures in a single assay, including routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty
acid composition, and various low-molecular-weight metabolites such as amino acids, ketone bodies,
and glycolysis metabolites quantified in molar concentration units. Technical details and epidemiological applications of the metabolic biomarker data have been reviewed (Soininen et al., 2015;
Würtz et al., 2017). The Nightingale NMR platform has received various regulatory approvals,
including CE-mark, and 37 biomarkers in the panel have been certified for diagnostics use. We
focused on this particular set of certified biomarkers, as we wanted to investigate if these markers of
systemic metabolism — commonly linked to cardiometabolic diseases — could also be associated
with future risk for severe infectious disease. Furthermore, these clinically validated biomarkers span
most of the different metabolic pathways measured by the NMR platform and could facilitate potential translational applications as they are certified for diagnostics use and are measured simultaneously in a single assay. The mean and standard deviation of concentrations for 249 quantified
metabolic biomarkers are given in Supplementary file 2.
Measurements of the metabolic biomarkers were conducted blinded prior to the linkage to the
UK Biobank health outcomes. The metabolic biomarker data were curated and linked to UK Biobank
clinical data in late-May 2020. The metabolic biomarker dataset has been made available for the
research community through the UK biobank in March 2021.
Severe pneumonia outcomes
We combined ICD-10 codes J12–J18 to define the pneumonia endpoint. To strengthen the analogy
with the analysis of severe COVID-19, we focused on severe pneumonia events, defined as diagnosis
in hospital or death records based on UK Hospital Episode Statistics data and national death registries (2507 incident cases in the current study). All analyses are based on the first occurrence of a
diagnosis. Therefore, 2658 individuals with recorded hospitalisation of pneumonia prior to the blood
sampling were excluded. Additionally, 346 individuals with pneumonia diagnosis recorded in primary
care settings and by self-reports were also omitted from the analyses. The registry-based follow-up
was from blood sampling in 2007–2010 through to 2016–2017, depending on assessment centre
(850,000 person-years).
Severe COVID-19 outcomes
We used COVID-19 data available in the UK Biobank per 3rd of February 2021, which covers test
results from 16 March to 1st of February 2021. These data include information on positive/negative
PCR-based diagnosis results and explicit evidence in the microbiological record on whether the participant was an inpatient (Resource UKBiobankD, 2020). For the present analyses, we focused on
PCR-positive inpatient diagnoses. These hospitalised cases are here denoted as severe COVID-19
(652 cases in the current study). COVID-19 data were not available for assessment centres in Scotland and Wales, so individuals from these centres were excluded. Individuals who had died during
follow-up prior to 2018 were also excluded, since they were never exposed to COVID-19.
Control group
The entire study population of non-cases was used as controls in the statistical analyses (n = 102,639
for severe pneumonia and n = 92,073 for severe COVID-19, respectively). This choice of controls is
consistent with the majority of publications examining risk factors for susceptibility to severe COVID19 (e.g. Ho et al., 2020; Williamson et al., 2020). It allows to address the question of whether an
initially healthy person with a high value of a given biomarker is at an increased risk of eventually
getting the disease outcome (severe pneumonia or COVID-19 hospitalisation) compared to people
from the general population with low levels of the biomarker. This choice of controls also overcomes
biases that may arise from analyses using confirmed mild infections as the control group, such as collider bias caused by non-random testing of the control group compared to the rest of the study population (Griffith et al., 2020).
Prevalent diseases
To examine the influence of prevalent diseases in the prospective analyses of severe pneumonia and
severe COVID-19, we used the following: prevalent cardiovascular disease (ICD-10 codes I20–I25,
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
16 of 20
Research article
Epidemiology and Global Health Medicine
I50, I60–I64, and G45), diabetes (E10–E14), lung cancer (C33–C34, D02.2, Z85.1), chronic obstructive
pulmonary disease (COPD; J43–J44), liver diseases (K70–K77), renal failure (N17–N19), and dementia
(F00-F03).
Statistical methods
Biomarker levels outside four interquartile ranges from median were considered as outliers and
excluded. All 37 biomarkers were scaled to standard deviation (SD) units prior to analyses. For biomarker association testing with severe pneumonia and with severe COVID-19 (as separate outcomes), we used logistic regression models adjusted for age, sex, and assessment centre. To
examine the utility of multiple biomarkers in combination, we used a weighted sum of the biomarkers optimised for association with future risk of severe pneumonia; this multi-biomarker score
was denoted as ‘infectious disease score’. To minimise the collinearity of the biomarkers, the multibiomarker score was trained using logistic regression with least absolute shrinkage and selection
operator (LASSO), which uses L1 regularisation that adds penalty equal to the absolute value of the
magnitude of the coefficients. The multi-biomarker infectious disease score was trained using half of
the study population with complete data available for the 37 clinically validated biomarkers
(n = 52,573 and 1257 severe pneumonia events) using five-fold cross-validation to optimise the regularizsation parameter l. The remaining half of the study population was used in validating the performance of the biomarker score in relation to future risk for severe pneumonia. The multi-biomarker
infectious disease score was subsequently tested for association with severe pneumonia and COVID19 in logistic regression models adjusted for age, sex, and assessment centre. We further examined
the effect of additional adjustment for body mass index (BMI) and smoking status (never, former,
current) and prevalent diseases. The associations were also examined by omitting individuals with
prevalent diseases and stratified by age and sex. In the case of severe pneumonia, we further examined the association magnitudes according to follow-up time: we used severe pneumonia events
occurring during 7–11 years after the blood sampling to mimic the decade long lag from blood sampling to the COVID-19 pandemic, and severe pneumonia events occurring within the first 2 years to
interpolate to the scenario of preventative COVID-19 screening carried out today. In both scenarios,
the confined follow-up times were arbitrarily chosen to be as short as possible while ensuring sufficient numbers of events. Finally, to explore potential non-linear effects, the infectious disease score
was plotted as a proportion of individuals who contracted severe pneumonia during follow-up when
binning individuals into percentiles of the infectious disease score (Khera et al., 2018). The time-resolution was further examined by Kaplan-Meier curves of the cumulative risk for severe pneumonia.
Acknowledgements
The authors are grateful to UK Biobank for access to data to undertake this study (Project #30418).
Additional information
Competing interests
Heli Julkunen: HJ is employee and holds stock options with Nightingale Health Plc. Anna Cichońska:
AC is employee and holds stock options with Nightingale Health Plc. Peter Würtz: PW is employee
and shareholder of Nightingale Health Plc. The other author declares that no competing interests
exist.
Funding
Funder
Author
Nightingale Health Plc
Heli Julkunen
Anna Cichońska
Peter Würtz
This work, including data collection, statistical analysis and writing of the
paper, was done by employees of Nightingale Health Plc.
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
17 of 20
Research article
Epidemiology and Global Health Medicine
Author contributions
Heli Julkunen, Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing; Anna Cichońska, Conceptualization, Data curation, Software, Formal analysis, Supervision, Investigation, Visualization,
Methodology, Writing - original draft, Project administration; P Eline Slagboom, Investigation, Methodology, Writing - original draft, Writing - review and editing; Peter Würtz, Conceptualization,
Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft,
Project administration, Writing - review and editing
Author ORCIDs
https://orcid.org/0000-0002-4282-0248
Heli Julkunen
Peter Würtz
https://orcid.org/0000-0002-5832-0221
Ethics
Human subjects: The UK Biobank recruited 502 639 participants aged 37-70 years in 22 assessment
centres across the UK. All participants provided written informed consent and ethical approval was
obtained from the North West Multi-Center Research Ethics Committee. Details of the design of the
UK Biobank have been reported previously (Sudlow et al PLOS Medicine 2015). The current analysis
was approved under UK Biobank Project 30418.
Decision letter and Author response
Decision letter https://doi.org/10.7554/eLife.63033.sa1
Author response https://doi.org/10.7554/eLife.63033.sa2
Additional files
Supplementary files
. Supplementary file 1. Table of weights of the biomarkers included in the multi-biomarker infectious
disease score derived using LASSO regression. The table indicates the weights for the 25 biomarkers
that were selected in derivation of the multi-biomarker infectious disease score, based on optimising
prediction for severe pneumonia using logistic regression with LASSO in the derivation half of the
study population (n = 52,573). Each biomarker was scaled to SD-units prior to the analyses. The
infectious disease score was then calculated as b1X1 + b2X2 + . . . + b25X25, with Xi denoting the
SD-standardised biomarker level for the ith biomarker and bi denoting the coefficient from the multibiomarker logistic regression model. DHA indicates docosahexaenoic acid; MUFA: monounsaturated
fatty acids; PUFA: polyunsaturated fatty acids; SFA: saturated fatty acids.
. Supplementary file 2. Mean biomarker concentrations and standard deviations, and odds ratios of
all 249 biomarkers with severe pneumonia. The table indicates mean concentrations and standard
deviations used for biomarker scaling. The table also includes numerical results of odds ratios of all
249 biomarkers with severe pneumonia, with corresponding 95% confidence intervals and p-values,
and whether each biomarker is clinically validated and included in the multi-biomarker infectious disease score.
. Transparent reporting form
Data availability
The data are available for approved researchers from UK Biobank. The metabolic biomarker data
has been released to the UK Biobank resource in March 2021.
The following dataset was generated:
Year Dataset title
Cichońska A,
2021 UK Biobank Nightingale biomarker https://biobank.ndph.ox. Biobank, 220
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
Dataset URL
Database and
Identifier
Author(s)
18 of 20
Research article
Epidemiology and Global Health Medicine
Julkunen H, Würtz P
data
ac.uk/showcase/label.
cgi?id=220
References
Ahola-Olli AV, Mustelin L, Kalimeri M, Kettunen J, Jokelainen J, Auvinen J, Puukka K, Havulinna AS, Lehtimäki T,
Kähönen M, Juonala M, Keinänen-Kiukaanniemi S, Salomaa V, Perola M, Järvelin MR, Ala-Korpela M, Raitakari
O, Würtz P. 2019. Circulating metabolites and the risk of type 2 diabetes: a prospective study of 11,896 young
adults from four finnish cohorts. Diabetologia 62:2298–2309. DOI: https://doi.org/10.1007/s00125-019-05001w, PMID: 31584131
Akbar AN, Gilroy DW. 2020. Aging immunity may exacerbate COVID-19. Science 369:256–257. DOI: https://doi.
org/10.1126/science.abb0762, PMID: 32675364
Almirall J, Serra-Prat M, Bolı́bar I, Balasso V. 2017. Risk factors for Community-Acquired pneumonia in adults: a
systematic review of observational studies. Respiration 94:299–311. DOI: https://doi.org/10.1159/000479089,
PMID: 28738364
Atkins JL, Masoli JAH, Delgado J, Pilling LC, Kuo C-L, Kuchel GA, Melzer D. 2020. Preexisting comorbidities
predicting COVID-19 and mortality in the UK biobank community cohort. The Journals of Gerontology: Series
A 75:2224–2230. DOI: https://doi.org/10.1093/gerona/glaa183
Bonafè M, Prattichizzo F, Giuliani A, Storci G, Sabbatinelli J, Olivieri F. 2020. Inflamm-aging: why older men are
the most susceptible to SARS-CoV-2 complicated outcomes. Cytokine & Growth Factor Reviews 53:33–37.
DOI: https://doi.org/10.1016/j.cytogfr.2020.04.005, PMID: 32389499
Deelen J, Kettunen J, Fischer K, van der Spek A, Trompet S, Kastenmüller G, Boyd A, Zierer J, van den Akker EB,
Ala-Korpela M, Amin N, Demirkan A, Ghanbari M, van Heemst D, Ikram MA, van Klinken JB, Mooijaart SP,
Peters A, Salomaa V, Sattar N, et al. 2019. A metabolic profile of all-cause mortality risk identified in an
observational study of 44,168 individuals. Nature Communications 10:3346. DOI: https://doi.org/10.1038/
s41467-019-11311-9, PMID: 31431621
Dierckx T, van Elslande J, Salmela H. 2020. The metabolic fingerprint of COVID-19 severity. medRxiv.
DOI: https://doi.org/10.1101/2020.11.09.20228221
Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, Collins R, Allen NE. 2017. Comparison of
Sociodemographic and Health-Related characteristics of UK biobank participants with those of the general
population. American Journal of Epidemiology 186:1026–1034. DOI: https://doi.org/10.1093/aje/kwx246,
PMID: 28641372
Griffith GJ, Morris TT, Tudball MJ, Herbert A, Mancano G, Pike L, Sharp GC, Sterne J, Palmer TM, Davey Smith
G, Tilling K, Zuccolo L, Davies NM, Hemani G. 2020. Collider Bias undermines our understanding of COVID-19
disease risk and severity. Nature Communications 11:5749. DOI: https://doi.org/10.1038/s41467-020-19478-2,
PMID: 33184277
Ho FK, Celis-Morales CA, Gray SR, Katikireddi SV, Niedzwiedz CL, Hastie C, Ferguson LD, Berry C, Mackay DF,
Gill JM, Pell JP, Sattar N, Welsh P. 2020. Modifiable and non-modifiable risk factors for COVID-19, and
comparison to risk factors for influenza and pneumonia: results from a UK biobank prospective cohort study.
BMJ Open 10:e040402. DOI: https://doi.org/10.1136/bmjopen-2020-040402, PMID: 33444201
Holmes MV, Millwood IY, Kartsonaki C, Hill MR, Bennett DA, Boxall R, Guo Y, Xu X, Bian Z, Hu R, Walters RG,
Chen J, Ala-Korpela M, Parish S, Clarke RJ, Peto R, Collins R, Li L, Chen Z, China Kadoorie Biobank
Collaborative Group. 2018. Lipids, lipoproteins, and metabolites and Risk of Myocardial Infarction and Stroke.
Journal of the American College of Cardiology 71:620–632. DOI: https://doi.org/10.1016/j.jacc.2017.12.006,
PMID: 29420958
Julkunen H, Cichońska A, Nightingale Health UK Biobank Initiative. 2020. Blood biomarker score identifies
individuals at high risk for severe COVID-19 a decade prior to diagnosis: metabolic profiling of 105,000 adults
in the UK biobank. medRxiv. DOI: https://doi.org/10.1101/2020.07.02.20143685
Kermali M, Khalsa RK, Pillai K, Ismail Z, Harky A. 2020. The role of biomarkers in diagnosis of COVID-19 - A
systematic review. Life Sciences 254:117788. DOI: https://doi.org/10.1016/j.lfs.2020.117788, PMID: 32475810
Keyes KM, Westreich D. 2019. UK Biobank, big data, and the consequences of non-representativeness. The
Lancet 393:1297. DOI: https://doi.org/10.1016/S0140-6736(18)33067-8
Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT,
Kathiresan S. 2018. Genome-wide polygenic scores for common diseases identify individuals with risk
equivalent to monogenic mutations. Nature Genetics 50:1219–1224. DOI: https://doi.org/10.1038/s41588-0180183-z, PMID: 30104762
Mangalmurti N, Hunter CA. 2020. Cytokine storms: understanding COVID-19. Immunity 53:19–25. DOI: https://
doi.org/10.1016/j.immuni.2020.06.017, PMID: 32610079
Messner CB, Demichev V, Wendisch D, Michalick L, White M, Freiwald A, Textoris-Taube K, Vernardis SI, Egger
AS, Kreidl M, Ludwig D, Kilian C, Agostini F, Zelezniak A, Thibeault C, Pfeiffer M, Hippenstiel S, Hocke A, von
Kalle C, Campbell A, et al. 2020. Ultra-High-Throughput clinical proteomics reveals classifiers of COVID-19
infection. Cell Systems 11:11–24. DOI: https://doi.org/10.1016/j.cels.2020.05.012, PMID: 32619549
Resource UKBiobankD. 2020. COVID-19 test results data. http://biobank.ctsu.ox.ac.uk/crystal/exinfo.cgi?src=
COVID19_tests; [Accessed February 3, 2021].
Ritchie SC, Würtz P, Nath AP, Abraham G, Havulinna AS, Fearnley LG, Sarin AP, Kangas AJ, Soininen P, Aalto K,
Seppälä I, Raitoharju E, Salmi M, Maksimow M, Männistö S, Kähönen M, Juonala M, Ripatti S, Lehtimäki T,
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
19 of 20
Research article
Epidemiology and Global Health Medicine
Jalkanen S, et al. 2015. The biomarker GlycA is associated with chronic inflammation and predicts Long-Term
risk of severe infection. Cell Systems 1:293–301. DOI: https://doi.org/10.1016/j.cels.2015.09.007,
PMID: 27136058
Shen B, Yi X, Sun Y, Bi X, Du J, Zhang C, Quan S, Zhang F, Sun R, Qian L, Ge W, Liu W, Liang S, Chen H, Zhang
Y, Li J, Xu J, He Z, Chen B, Wang J, et al. 2020. Proteomic and metabolomic characterization of COVID-19
patient sera. Cell 182:59–72. DOI: https://doi.org/10.1016/j.cell.2020.05.032
Soininen P, Kangas AJ, Würtz P, Suna T, Ala-Korpela M. 2015. Quantitative serum nuclear magnetic resonance
metabolomics in cardiovascular epidemiology and genetics. Circulation: Cardiovascular Genetics 8:192–206.
DOI: https://doi.org/10.1161/CIRCGENETICS.114.000216, PMID: 25691689
Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B,
Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R. 2015. UK biobank: an open
access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS
Medicine 12:e1001779. DOI: https://doi.org/10.1371/journal.pmed.1001779, PMID: 25826379
Williamson EJ, Walker AJ, Bhaskaran K. 2020. OpenSAFELY: factors associated with COVID-19 death in 17
million patients. Nature 584:430–436. DOI: https://doi.org/10.1038/s41586-020-2521-4
Würtz P, Kangas AJ, Soininen P, Lawlor DA, Davey Smith G, Ala-Korpela M. 2017. Quantitative Serum Nuclear
Magnetic Resonance Metabolomics in Large-Scale Epidemiology: A Primer on -Omic Technologies. American
Journal of Epidemiology 186:1084–1096. DOI: https://doi.org/10.1093/aje/kwx016, PMID: 29106475
Zhou F, Yu T, Du R, Fan G, Liu Y, Liu Z, Xiang J, Wang Y, Song B, Gu X, Guan L, Wei Y, Li H, Wu X, Xu J, Tu S,
Zhang Y, Chen H, Cao B. 2020. Clinical course and risk factors for mortality of adult inpatients with COVID-19
in Wuhan, China: a retrospective cohort study. The Lancet 395:1054–1062. DOI: https://doi.org/10.1016/
S0140-6736(20)30566-3, PMID: 32171076
Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033
20 of 20
Publication III
Heli Julkunen, Anna Cichońska, Mika Tiainen, Harri Koskela, Kristian
Nybo, Valtteri Mäkelä, Jussi Nokso-Koivisto, Kati Kristiansson, Markus
Perola, Veikko Salomaa, Pekka Jousilahti, Annamari Lundqvist, Antti J.
Kangas, Pasi Soininen, Jeffrey C. Barrett, Peter Würtz. Atlas of plasma
NMR biomarkers for health and disease in 118,461 individuals from the UK
Biobank. Nature Communications, February 2023.
© 2023 The Author(s). This is an open access article distributed under the terms of
the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
143
Article
https://doi.org/10.1038/s41467-023-36231-7
Atlas of plasma NMR biomarkers for health
and disease in 118,461 individuals from the
UK Biobank
Received: 28 June 2022
Check for updates
1234567890():,;
1234567890():,;
Accepted: 20 January 2023
Heli Julkunen 1 , Anna Cichońska1, Mika Tiainen1, Harri Koskela1,
Kristian Nybo1, Valtteri Mäkelä1, Jussi Nokso-Koivisto1, Kati Kristiansson 2,
Markus Perola2, Veikko Salomaa 2, Pekka Jousilahti 2, Annamari Lundqvist2,
Antti J. Kangas1, Pasi Soininen1, Jeffrey C. Barrett1 & Peter Würtz1
Blood lipids and metabolites are markers of current health and future disease
risk. Here, we describe plasma nuclear magnetic resonance (NMR) biomarker
data for 118,461 participants in the UK Biobank. The biomarkers cover 249
measures of lipoprotein lipids, fatty acids, and small molecules such as amino
acids, ketones, and glycolysis metabolites. We provide an atlas of associations
of these biomarkers to prevalence, incidence, and mortality of over 700
common diseases (nightingalehealth.com/atlas). The results reveal a plethora
of biomarker associations, including susceptibility to infectious diseases and
risk of various cancers, joint disorders, and mental health outcomes, indicating
that abundant circulating lipids and metabolites are risk markers beyond
cardiometabolic diseases. Clustering analyses indicate similar biomarker
association patterns across different disease types, suggesting latent systemic
connectivity in the susceptibility to a diverse set of diseases. This work highlights the value of NMR based metabolic biomarker profiling in large biobanks
for public health research and translation.
UK Biobank is a prospective study of ~500,000 individuals who have
volunteered to have their health information shared with scientists
across the globe to advance public health research. This open resource
is unique in its size and availability of extensive phenotypic and
genomic data1–3. A selection of 30 routine blood biomarkers has previously been measured in the full cohort4,5, but there is a unique
opportunity to evaluate the public health relevance of a wider range of
biomarkers and accelerating translation, as exemplified by genomewide genotyping for population-based risk identification6.
Here, we describe detailed metabolic biomarkers quantified by
nuclear magnetic resonance (NMR) spectroscopy of 118,461 baseline
plasma samples, generated by Nightingale Health Plc (Fig. 1a). The
sample size is more than ten-fold larger than many of the largest
metabolic profiling studies conducted to date7,8. The NMR biomarker
panel comprises 249 measures of lipids and metabolites (Fig. 1b).
These data are now available to approved researchers through the UK
Biobank Showcase for all aspects of public health research. Many
studies are already using these biomarker data, spanning applications
related to, for instance, risk prediction, causal analyses, genetic discovery and drug target validation9–18.
In this study, we present a comprehensive atlas of biomarkerdisease associations (available at nightingalehealth.com/atlas), systematically examined across the 249 metabolic measures in relation to
presence, future onset and mortality of over 700 disease outcomes
(Fig. 1c). We illustrate the use of the atlas for biomarker discovery and
identification of connections between overall biomarker signatures for
various diseases. We replicate the findings in over 30,000 individuals
from five prospective cohorts in the Finnish Institute for Health and
Welfare (THL) Biobank profiled using the same NMR platform. Our
biomarker-disease atlas may serve as a starting point to move from
1
Nightingale Health Plc, Helsinki, Finland. 2Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland.
e-mail: Heli.julkunen@nightingalehealth.com; Peter.wurtz@nightingalehealth.com
Nature Communications | (2023)14:604
1
Article
https://doi.org/10.1038/s41467-023-36231-7
Fig. 1 | Nuclear magnetic resonance (NMR) biomarker data in the UK Biobank
and atlas of disease associations. a Process of the Nightingale Health-UK Biobank
Initiative: 1) EDTA plasma samples from the baseline survey were prepared on 96well plates and shipped to Nightingale Health laboratories in Finland, 2) Buffer was
added and samples transferred to NMR tubes, 3) Samples were measured using six
500 MHz proton NMR spectrometers, 4) Automated spectral processing software
was used to quantify 249 biomarker measures from each sample, 5) Quality control
metrics based on blind duplicates and internal control samples were used to track
consistency metrics throughout the project, 6) Biomarker data were cleaned,
provided to UK Biobank and released to the research community. b Overview of
biomarker types included in the Nightingale Health NMR biomarker panel.
c Schematic illustration of the atlas of biomarker-disease associations published
along with this study. The webtool allows to display the associations of all biomarkers versus prevalence, incidence and mortality of each disease endpoint, as
well as show each biomarker versus all disease endpoints.
biomarker discovery to more detailed analyses in biological and clinical context.
plasma samples were picked randomly and are therefore representative of the 502,543 participants in the full cohort. Samples were generally drawn non-fasting, with an average of 4 hours since the last meal.
The data release also contains biomarker measurements of ~4000
repeat visit samples collected on average four years after the baseline,
with ~1500 participants having biomarker data from both baseline and
the repeat-visit survey.
The Nightingale Health NMR biomarker platform quantifies 249
metabolic measures from each sample in a single experimental assay,
Results
Plasma biomarker profiling by NMR
We measured lipid and metabolite biomarkers from 118,461 baseline
plasma samples using the Nightingale Health NMR platform
(Fig. 1a)7,9,19. Table 1 shows the characteristics of the participants with
NMR biomarker data currently available in the UK Biobank. The EDTA
Nature Communications | (2023)14:604
2
Article
https://doi.org/10.1038/s41467-023-36231-7
Table 1 | Characteristics of the UK Biobank participants with
plasma NMR biomarkers in the first data release from
Nightingale Health
Subset with NMR biomarkers, baseline
Full cohort,
baseline
Number of participants
118,461
502,543
Age at blood sampling
(median, [range])
58 [39–71]
58 [37–73]
Females (%)
54
54
Body mass index (kg/m2), mean
27.4
27.4
Smoking prevalence (regular,
occasional; %)
7.9, 2.7
7.8, 2.7
Fasting time, mean (h)
3.8
3.8
Self-reported cholesterollowering medication use (%)
18
17
comprising 168 measures in absolute levels and 81 ratio measures
(Fig. 1b). The biomarkers include measures already routinely used in
clinical practice, such as cholesterol, as well as many emerging biomarkers increasingly measured in cohorts, such omega-3 and other
fatty acids7,20. The panel of biomarkers is based on feasibility for
accurate quantification in a high-throughput manner, and therefore
mostly reflects molecules with high circulating concentration. Most of
the biomarkers relate to lipoprotein metabolism, with the lipid concentrations and composition measured in 14 lipoprotein subclasses in
terms of triglycerides, phospholipids, total cholesterol, cholesterol
esters, and free cholesterol, and total lipid concentration within each
subclass. The panel additionally includes the absolute concentration
and relative balance of the most abundant plasma fatty acids, such as
saturated fatty acids, and small molecules, like amino acids, and
ketone bodies. Apolipoproteins B and A1, and two inflammatory protein measures, albumin and glycoprotein acetyls, are also measured,
owing to their high abundance in plasma.
Details of the NMR biomarker measurements of the UK Biobank
samples are described in ‘Methods’. Key steps of the measurement
process are illustrated in Supplementary Fig. 1 and an overview of all
measured biomarkers is provided in Supplementary Fig. 2. The quality
control protocol is described in Supplementary Methods and Supplementary Fig. 3. Coefficients of variation of the biomarkers are
shown is Supplementary Fig. 4 and technical as well as biological
variability illustrated in Supplementary Fig. 5. Comparisons of the NMR
biomarker measurements to routine clinical chemistry is illustrated in
Supplementary Fig. 6 and to other multi-biomarker assays measured in
smaller cohorts in Supplementary Figs. 7 and 8.
Atlas of biomarker–disease associations
The extensive electronic health records in the UK Biobank and the
unprecedented sample size make it possible to study biomarker
associations across the full spectrum of common diseases. We systematically computed the associations of the 249 NMR biomarkers
with over 700 disease endpoints. Incident and mortality endpoints
were defined by 3-character ICD-10 codes from nationwide hospital
episode statistics and death records for diseases with at least 50 events
occurring during 10 years after blood sampling. Prevalent endpoints
were defined for diseases with over 50 events in the hospital records
during ~25 years before the blood sampling. Details of the data preprocessing and statistical modelling are described in Methods. We
collated the results in form of an online atlas of biomarker-disease
associations available at nightingalehealth.com/atlas (Fig. 1c). The
webtool can display interactive forestplots for all biomarkers with
prevalence, incidence, and mortality of each disease endpoint, as well
as disease-wide association plots for each of the 249 biomarkers.
We observed a total of 33,764 individual biomarker associations
to incident disease endpoints at p < 5e-5 (Methods). Similarly, for 648
Nature Communications | (2023)14:604
prevalent disease endpoints and 77 causes of death, 26,035 and
3,055 significant associations were identified, respectively. These biomarker associations were not concentrated in cardiometabolic diseases but spread across nearly all ICD-10 chapters. Examples include
infectious diseases of both systemic and local character, certain cancers as well as mental and neurological disorders and musculoskeletal
diseases. The magnitudes of biomarker associations for these diverse
types of diseases were often similar to those of cardiovascular diseases. In the subsequent analyses in this paper, we focus on analyses of
the future onset of diseases from ICD-10 chapters A-N and the 37
biomarkers from the Nightingale Health NMR platform certified for
diagnostic use.
Biomarkers across the spectrum of diseases
Examining the NMR biomarkers across the spectrum of common diseases can provide insights into disease pathophysiology and specificity
of the biomarkers. Fig. 2a illustrates the span of diseases in different
ICD-10 chapters associated with the 37 clinically certified biomarkers.
Many of the biomarkers exhibited associations across all types of
diseases, with the exception of diseases of the eyes and the ears. For
example, monounsaturated fatty acids relative to total fatty acids
(MUFA%) were associated with almost 200 different disease endpoints
spanning all ICD-10 chapters A-N. Also, more established biomarkers
such as omega-3% (i.e. concentration relative to total fatty acids) and
routine cholesterol measures were associated with a wide spectrum of
diseases. Glycolysis-related metabolites and amino acids displayed
fewer associations, but still spanned more than endocrine and circulatory diseases.
Figure 2b–e shows the strongest incident disease associations in
detail for four exemplar biomarkers; further examples are shown in
Supplementary Fig. 9. The inflammatory biomarker glycoprotein
acetyls, also known as GlycA, was associated with the risk of 32% of the
incident disease endpoints examined (p < 5e-5), with a median hazard
ratio of 1.26 per 1-SD increment in the biomarker concentration. The
most significant associations were observed for gout, type 2 diabetes,
smoking dependence, kidney diseases, chronic obstructive pulmonary
disorder, myocardial infarction, pneumonia and anemias. Figure 2c
highlights the strongest disease associations for the ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA),
showing as widespread disease associations as for GlycA. Similar
results were observed also for other fatty acid measures, such as
omega-3% and omega-6% as well as MUFA% (Supplementary Fig. 9a–c).
By contrast to this pattern of diverse associations, some biomarkers exhibited more distinct disease specificity. For instance, the
amino acid alanine was primarily associated with the risk of diabetes
and complications related to diabetes (Fig. 2d). Glycine and glutamine
(Supplementary Fig. 9d, e) were also associated with diabetes-related
complications, but additionally with the risk of liver and kidney diseases, with lower plasma concentrations indicating higher disease risk.
Glycine was also strongly associated with many circulatory disease
endpoints, in line with the earlier suggested causal role of glycine
levels in coronary heart disease21. Most of the biomarkers had a consistent direction of associations across different diseases, but not all.
For example, higher branched-chain amino acid levels were associated
with a higher risk for many metabolic diseases but a lower risk for a
range of other diseases such as lung diseases, hernia and smoking
dependence (Fig. 2e). A small number of biomarkers showed only weak
magnitude of association across the spectrum of diseases, such as the
ketone body 3-hydroxybutyrate (Supplementary Fig. 9f).
Considered from the disease perspective, Fig. 3 shows the biomarker association profiles for the incidence of six exemplar diseases.
Multiple biomarkers are associated with incident hospitalisation for
sleep disorders, depression, lung cancer and sepsis, with magnitudes
of associations generally similar to those of myocardial infarction. The
majority of the biomarkers associated exclusively in one direction of
3
Article
https://doi.org/10.1038/s41467-023-36231-7
Fig. 2 | Biomarkers for future disease onset across a spectrum of diseases.
a Total number of incident disease associations by biomarker at statistical significance level p < 5e-5. The disease outcomes were defined based on 3-character
ICD-10 codes with 50 or more events from chapters A-N, with a total of 556 diseases
tested for association. The colour coding indicates the proportion of associations
coming from each ICD-10 chapter from A to N. b–e Twenty most significant associations for four biomarkers: b Glycoprotein acetyls, c Ratio of polyunsaturated
fatty acids to monounsaturated fatty acids (PUFA/MUFA), d Alanine, and
e Branched-chain amino acids (BCAA). The forestplots highlight 20 of the most
significant associations, arranged according to decreasing association magnitude.
Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled
biomarker concentrations. All models were adjusted for age, sex and UK biobank
assessment centre, using age as the timescale of the Cox proportional hazards
regression. Similar disease-wide association plots for all 249 biomarkers across all
endpoints analysed are available in the biomarker-disease atlas webtool. Source
data are provided as a Source Data file.
effect across these diseases and exhibited similar association patterns
overall. An exception to this is osteoporosis, for which increased risk
was characterised by decreased concentrations of branched-chain
amino acids and triglycerides, and higher high-density lipoprotein
cholesterol and apolipoprotein A1—in contrast to the other diseases in
Fig. 3. All biomarker associations were robust to a sensitivity analysis
excluding the first two years of follow-up, suggesting that they are not
driven by clinically incipient cases at baseline (Supplementary Fig. 10).
Shared biomarker signatures for different diseases
Nature Communications | (2023)14:604
Comparing biomarker signatures between diseases may help to
understand molecular differences between conditions with similar
pathophysiology and identify novel connections8,22. Figure 4a shows
examples of clustering of diseases according to their overall biomarker
association patterns. In the vertical direction, biomarkers such as
GlycA and MUFA% cluster together due to their similarity in associations with many different types of diseases. Most amino acids cluster
4
Article
https://doi.org/10.1038/s41467-023-36231-7
M81 Osteoporosis
I21 Myocardial infarction
G47 Sleep disorders
F32 Depression
C34 Lung cancer
A41 Sepsis
Inflammation
Amino acids
Glycoprotein acetyls
Alanine
Glycine
0.6
1.0
1.4
1.0
1.4
1.0
1.4
1.0
1.4
1.0
1.4
Histidine
Apolipoproteins
0.6
1.0
1.4
ApoB/ApoA1
Aromatic amino acids
ApoB
ApoA1
Phenylalanine
Tyrosine
0.6
0.6
1.0
1.4
Branched−chain amino acids
Fatty acid ratios
Omega−6/Omega−3
DHA %
Isoleucine
Leucine
SFA %
Valine
PUFA %
MUFA %
Total BCAA
0.6
1.0
1.4
PUFA/MUFA
Omega−6 %
Fluid balance
Omega−3 %
Creatinine
0.6
Albumin
Fatty acids
0.6
1.0
1.4
SFA
Glycolysis related metabolites
PUFA
Glucose
Omega−6
Lactate
Total fatty acids
0.6
1.0
1.4
DHA
Omega−3
Cholesterol
MUFA
VLDL−C
0.6
Total−C
Triglycerides
Clinical LDL−C
HDL−C
Total triglycerides
0.6
1.0
1.4
Hazard ratio (95% CI), per 1−SD increment
0.6
Hazard ratio (95% CI), per 1−SD increment
Fig. 3 | Biomarker profiles for the incidence of various types of diseases. Hazard
ratios of biomarkers with the incidence of six disease examples: A41 Sepsis (red;
n = 117,806, 2986 events), C34 Lung cancer (light blue; n = 117,964, 1210 events), F32
Depression (green; n = 116,993, 5455 events), G47 Sleep disorders (dark blue;
n = 117,325, 1865 events), I21 Myocardial infarction (orange; n = 116,797, 2523
events) and M81 Osteoporosis (lavender; n = 117,538, 3326 events). Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker
concentrations. The models were adjusted for age, sex and UK biobank assessment
centre, using age as the timescale of the Cox proportional hazards regression. Filled
points indicate statistically significant associations (p < 5e-5), and hollow points are
non-significant ones. Similar forest plots for all 249 NMR biomarkers across all
endpoints analysed are provided in the biomarker-disease atlas webtool. BCAA
indicates branched-chain amino acids, DHA docosahexaenoic acid; MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty
acids. Source data are provided as a Source Data file.
together, but glycine and histidine have deviating associations more
similar to those of omega-6% and omega-3%, respectively. In the horizontal direction, the clustering analysis reveals both well-known
connections between diseases and less anticipated similarities. For
example, diabetes has highly similar biomarker association patterns
with several of its complications, including polyneuropathies and retinal disorders. Common diseases of an infectious origin, pneumonia
and general bacterial infection, also cluster together in terms of their
overall biomarker association patterns, as does COPD and lung cancer.
Some of the less well-known connections include, for instance, liver
diseases and polyneuropathies which had almost identical overall
biomarker associations as further highlighted in Fig. 4b.
The biomarker signatures were similar for many diseases, but
notable differences may still be observed for diseases of similar
pathophysiological origin20. Figure 4c illustrates how acute myocardial infarction and hospitalisation for heart failure have many
deviating biomarker associations even though these two endpoints
are often combined for clinical trial analyses in the five-point major
adverse cardiovascular event (MACE) definition. Supplementary
Figs. 11–13 further illustrate similarities and differences in the biomarker signatures for various other types of cardiovascular diseases.
The biomarker association pattern differed for different types of
myocardial infarction, angina, chronic ischaemic heart disease, and
different types of stroke. Even more pronounced differences were
observed when compared to heart failure and peripheral artery disease. In particular, many biomarker associations appeared to be
stronger for other circulatory endpoints than for myocardial infarction and ischaemic stroke. These results may suggest potential
Nature Communications | (2023)14:604
5
Article
benefits for risk prediction separately for these types of cardiovascular events.
Replication of biomarker signatures
Replication is essential in biomarker studies, no matter the sample size
of the discovery analyses. We, therefore, sought to replicate the NMR
biomarker associations in the UK Biobank in two ways: first by
Nature Communications | (2023)14:604
https://doi.org/10.1038/s41467-023-36231-7
comparing the results to biomarkers measured by independent
laboratory assays from the same UK Biobank samples, and second by
analysing NMR biomarker data for over 30,000 participants from the
Finnish Institute for Health and Welfare Biobank (THL biobank).
Figure 5 shows the high concordance between disease associations for
the eight biomarkers that have been measured by both NMR and
clinical chemistry. The associations always have the same direction,
6
Article
https://doi.org/10.1038/s41467-023-36231-7
Fig. 4 | Clustering of incident diseases according to their biomarker signatures.
a Heatmap showing the clustering of biomarker association signatures for the
incidence of a diverse set of diseases. The diseases represent three diseases from
each ICD-10 chapter from A to N, selected based on the highest number of significant associations. The colouring indicates the association magnitudes in units of
the effect sizes, i.e log(hazard ratio per SD). The dendrograms depict the similarity
of the association patterns, computed using complete linkage clustering based on
the linear correlation between the association signatures. Significant associations
with p value < 5e-5 are marked with an asterisk. All models were adjusted for age,
sex and UK biobank assessment centre, using age as the timescale of the Cox
proportional hazards regression. Examples of overall biomarker signatures compared for incidence of b Other diseases of liver (K76) and Other polyneuropathies
(G62), and c Acute myocardial infarction (I21) and Heart failure (I50). The hazard
ratios for each biomarker are shown as points with 95% confidence intervals (CI)
indicated in vertical and horizontal error bars. The colouring of the points indicates
the significance of the biomarker association for the pair of diseases. The red lines
denote a hazard ratio of 1, and the grey line denotes the diagonal. BCAA indicates
branched-chain amino acids, DHA docosahexaenoic acid, MUFA monounsaturated
fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty acids. Source data
are provided as a Source Data file.
NMR
Clinical chemistry
A41 Sepsis
C34 Lung cancer
F32 Depression
Total−C
Clinical LDL−C
HDL−C
Total triglycerides
ApoB
ApoA1
Albumin
Glucose
0.8
0.9
1.0
G47 Sleep disorders
1.1
0.8
0.9
1.0
1.1
0.8
I21 Myocardial infarction
0.9
1.0
1.1
1.2
M81 Osteoporosis
Total−C
Clinical LDL−C
HDL−C
Total triglycerides
ApoB
ApoA1
Albumin
Glucose
0.6
0.7
0.8
0.9 1.0
0.7
0.8
0.9
1.0
1.1
1.2
1.3
0.9
1.0
1.1
1.2
Hazard ratio (95% CI)
Fig. 5 | Comparison of nuclear magnetic resonance (NMR) and clinical chemistry biomarker associations. Hazard ratios of biomarkers for which both NMRbased (red) and clinical chemistry (blue) measurements are available, against the
incidence of six disease examples: A41 Sepsis (n = 117,806, 2986 events), C34 Lung
cancer (n = 117,964, 1210 events), F32 Depression (n = 116,993, 5455 events), G47
Sleep disorders (n = 117,325, 1865 events), I21 Myocardial infarction (n = 116,797,
2523 events) and M81 Osteoporosis (n = 117,538, 3326 events). Data are presented as
hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations. The models were adjusted for age, sex and UK biobank assessment
centre, using age as the timescale of the Cox proportional hazards regression. Filled
points indicate statistically significant (p < 5e-5) associations, hollow points nonsignificant ones. Source data are provided as a Source Data file.
and the hazard ratios are sometimes stronger for one assay and
sometimes another, suggesting neither is systematically better at
capturing disease association. Small deviations in the results may be
because the plasma samples used for the NMR measurements were
more affected by a known sample dilution issue than the corresponding serum samples used for clinical chemistry4. The consistency
between the NMR-based and clinical chemistry assays in absolute
concentrations is illustrated in Supplementary Fig. 7 and further discussed in Methods.
We note that low-density lipoprotein (LDL) cholesterol and apolipoprotein B displayed inverse associations across a wide range of
diseases, i.e. higher concentration was associated with lower risk for
disease incidence (Fig. 5). This observation, which is surprising compared to the existing literature on LDL as a risk factor for heart disease,
is seen in both the NMR and clinical chemistry measurements, indicating that it stems from characteristics of the UK Biobank study rather
than any property of the NMR measurements. This observation was
mainly explained by widespread use of lipid-lowering medications in
the case of cardiovascular endpoints, since the inverse lipid associations were attenuated or inverted direction of effect when individuals
on lipid-lowering medication were excluded (Supplementary Fig. 14).
Nonetheless, for most non-circulatory diseases, including five of the
six disease examples shown in Fig. 3, the LDL cholesterol associations
remained inverse even after excluding individuals on cholesterollowering medication (Supplementary Fig. 15), warranting further
investigation in other cohorts.
We further replicated the associations observed in UK Biobank by
a meta-analysis of five independent population-based cohorts from
Finland measured using the same NMR platform (Methods; clinical
characteristics listed in Supplementary Table 1). Figure 6 illustrates the
consistency of the biomarker association signatures against all-cause
mortality and five available incident disease outcomes. Replication
results for the remaining available endpoints are shown in Supplementary Fig. 16.
The biomarker associations were generally consistent in the two
biobanks, especially for amino acids and other polar metabolites,
fatty acid ratios and the two inflammatory protein measures. The
results for absolute fatty acid concentrations deviated between the
two study populations, whereas the results for fatty acid measures
scaled relative to total fatty acids were highly concordant. This may
suggest that such ratio measures are more easily transferrable across
sampling approaches. The biomarker associations were consistent in
Nature Communications | (2023)14:604
7
Article
https://doi.org/10.1038/s41467-023-36231-7
THL biobank cohorts, meta−analysis
UK biobank
UK biobank, excluding individuals using cholesterol lowering medication
All−cause mortality
b
Hazard ratio (95% CI)
Hazard ratio (95% CI)
a
1.2
1.0
1.2
1.0
0.8
Diabetes
Al
a
G ni
H lyc n e
is in
tid e
Ap Apine
Ph oB Ap oB
e n /A oA
y l po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
eg −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
U
a F DH %
To−6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU −6
M FA
U
FA
S
F
C
re D A
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
ce ty
rid ls
es
Al
a
G ni
H lyc n e
is in
tid e
Ap Apine
Ph oB Ap oB
en /A oA
yl po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
eg −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
a− UF DH %
To 6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU −6
M FA
U
FA
C SFA
re D
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
ce ty
rid ls
es
0.8
c
Major adverse cardiovascular event
d
COPD
Hazard ratio (95% CI)
1.8
1.6
1.4
1.2
1.0
0.8
1.8
1.6
1.4
1.2
1.0
0.8
Al
a
G ni
H lyc ne
is in
tid e
Ap Apine
Ph oB Ap oB
en /A oA
yl po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
eg −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
a− UF DH %
To 6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU −6
M FA
U
FA
C SFA
re D
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
ce ty
rid ls
es
0.6
1.0
0.8
f
Liver diseases
Hazard ratio (95% CI)
Chronic kidney failure
1.2
1.8
1.6
1.4
1.2
1.0
0.8
0.6
Al
a
G ni
H lyc ne
is in
tid e
Ap Apine
Ph oB Ap oB
en /A oA
yl po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
eg −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
U
a F DH %
To−6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU −6
M FA
U
FA
S
F
C
re D A
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
ce ty
rid ls
es
Hazard ratio (95% CI)
e
1.4
Al
a
G ni
H lyc n e
is in
tid e
Ap Apine
Ph oB Ap oB
e n /A oA
y l po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
e g −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
U
a F DH %
To−6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU −6
M FA
U
FA
S
F
C
re D A
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
ce ty
rid ls
es
0.6
Al
a
G ni
H lyc n e
is in
tid e
Ap Apine
Ph oB Ap oB
e n /A oA
y l po 1
al A
a
To Tyr nin1
ta os e
Is l B ine
ol CA
eu A
Le cin
uc e
V in
T al e
C
lin V ota ine
ic LD l−
al L C
L −
O HDL C
m
−
O eg DL C
m a −C
eg −3
a
PU −6 %
M FA %
U %
FA
O
m
eg P SFA %
a− UF DH %
To 6/ A/ A
ta Om MU %
lf e F
at g A
O ty a a−3
m c
O eg ids
m a
eg −3
a
PU − 6
M FA
U
FA
C SFA
re D
at H A
G
i
A
n
ly
lb in
co
G u e
To pro lucmin
ta te La os
l t in c e
rig a ta
ly ce te
c e ty
rid ls
es
Hazard ratio (95% CI)
1.6
Fig. 6 | Replication of biomarker associations with incident disease. Biomarker
associations for six disease endpoints are shown for THL Biobank (red) and UK
Biobank for the full study population (light blue) as well as for individuals without
self-reported use of cholesterol-lowering medication (dark blue): a All-cause mortality, b Major adverse cardiovascular event, c Diabetes, d Chronic obstructive
pulmonary disease (COPD), e Chronic kidney failure and f Liver diseases. Results
from THL biobank were meta-analysed for five prospective Finnish cohorts (FINRISK 1997, 2002, 2007, and 2012, and Health 2000). Data are presented as hazard
ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations.
All models were adjusted for age and sex, using age as the timescale of the Cox
proportional hazards regression. Analyses in the UK biobank were additionally
adjusted for the UK biobank assessment centre. Filled points indicate statistically
significant associations (p < 5e-5), and hollow points non-significant ones. Black
horizontal line denotes a hazard ratio of 1. Event numbers for incident disease or
mortality in the two biobanks are shown in Table 2. ICD-10 codes used for compiling the composite endpoints are listed in Supplementary Table 2. The replication
results are shown here for six endpoints available in THL biobank; results for all
overlapping endpoints are shown in Supplementary Fig. 16. Results are shown
separately for each of the five Finnish cohorts in Supplementary Fig. 17. BCAA
indicates branched-chain amino acids, DHA docosahexaenoic acid, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty
acids. Source data are provided as a Source Data file.
each of the five Finnish cohorts, although there was a tendency for
stronger hazard ratios for the cohort with shortest follow-up time
(Supplementary Fig. 17). The greatest deviations were observed for
aforementioned LDL-related biomarkers, which displayed strong
inverse associations for diabetes and major adverse cardiovascular
event (MACE) in UK Biobank but flat or weakly positive associations
in the Finnish cohorts. By excluding participants using cholesterollowering medication in the UK Biobank, the associations generally
became more consistent (Fig. 6). However, many of the inverse
associations for LDL cholesterol and related lipids also replicated in
the Finnish cohorts, such as in the case of all-cause mortality and
chronic kidney failure (Fig. 6a, e).
Nature Communications | (2023)14:604
8
Article
https://doi.org/10.1038/s41467-023-36231-7
Table 2 | Sample size and number of events for replication analyses
Endpoint
THL Biobank
Number of events/N (%)
UK Biobank
Number of events/N (%)
UK Biobank subset
Number of events/N (%)
All-cause mortality
3 928/34 019 (11.55%)
7 802/117 868 (6.62%)
5 219/97 212 (5.37%)
Chronic kidney failure
328/33 982 (0.97%)
4 254/117 550 (3.62%)
2 270/97 074 (2.34%)
COPD
732/33 736 (2.17%)
4 404/117 141 (3.76%)
2 885/96 811 (2.98%)
Liver diseases
417/33 783 (1.23%)
2 696/117 328 (2.3%)
1 884/96 828 (1.95%)
MACE
4 640/31 754 (14.61%)
6 511/115 745 (5.63%)
4 311/96 885 (4.45%)
Diabetes
2 703/31 565 (8.56%)
6 836/115 579 (5.91%)
3 376/96 746 (3.49%)
UK Biobank subset represents subset excluding individuals with self-reported use of cholesterol lowering medication.
COPD chronic obstructive pulmonary disease, MACE major adverse cardiovascular event.
Age and lipid-lowering medication effects
Excluding individuals using lipid-lowering medication might introduce
collider bias in the findings by selecting for healthier individuals. To
provide more context for evaluating these results, we also replicated
the results in FINRISK 1997 cohort which has a low prevalence of
cholesterol-lowering medication use due to the cohort being sampled
in 1997 (3.5% in the full cohort, 4.5% after matching age to UK biobank).
The results are shown in Supplementary Fig. 18, with analyses matched
to the age range of UK Biobank participants. Most of the biomarker
associations were consistent in this comparison and the aforementioned inverse and weak associations for LDL-related lipids observed in
the UK biobank were also seen in the FINRISK 1997 cohort that is much
less affected by cholesterol-lowering medication. This includes, for
instance, the null association of LDL cholesterol with MACE and the
inverse associations with all-cause mortality and chronic kidney failure.
These results suggest that the observations made in UK Biobank after
excluding cholesterol-lowering medication users are likely not primarily due to collider bias, but rather relate to the characteristics of the
higher-aged individuals in UK Biobank.
To provide another angle on the influence of cholesterol-lowering
and other medications on the biomarker associations, we stratified the
biomarker analyses by age tertiles4. As the use of cholesterol-lowering
and other medications increases with age, younger age groups are less
prone to such sources of bias. Fig. 7 shows age-stratified biomarker
associations for 17 biomarkers across the incidence of the six exemplary diseases from Fig. 3. Results for the remaining 20 biomarkers are
shown in Supplementary Fig. 19. In many cases, the association magnitudes were stronger in the youngest age tertile. In particular, notable
differences were observed in the case of LDL-related biomarkers, for
which the associations became weaker in the older tertiles against
myocardial infarction and completely inverted direction against noncirculatory diseases, which can likely be at least partially attributed to
the higher prevalence of statin use in the oldest age groups. Increased
association magnitudes with younger age were also observed for
biomarkers known to not be affected by lipid-lowering treatment23,24,
including inflammatory protein biomarkers and several amino acids,
suggesting that the effects cannot be entirely attributed to a lower
prevalence of statin use among the younger individuals. Comparison
of the age stratified association estimates across all endpoints analysed
are available in the biomarker-disease atlas webtool.
Discussion
Detailed biomarker profiling is a key part of the promise of precision
medicine initiatives to transform preventative healthcare. Blood biomarkers provide modifiable molecular measures which relate to future
health outcomes and serve as intermediates between lifestyle factors
and disease risk. This study describes the generation of NMR biomarker data by Nightingale Health in the UK Biobank, which is currently the world’s largest resource of metabolic biomarkers linked to
health records. These data greatly extend the blood biomarker coverage in the UK Biobank and provide a wide span of molecular
Nature Communications | (2023)14:604
biomarkers not commonly measured in clinical practice, including
amino acids, ketones and fatty acids. With over 118,000 plasma samples profiled in the UK Biobank, the addressable research questions
extend vastly beyond biomarker discovery and the large sample size
benefits, for example, causal analyses and risk prediction9,13,14,17. Due to
the streamlined data access policy in UK Biobank, the data release
opens possibilities for the research community to use the entire epidemiological toolbox to study the NMR biomarkers in relation to
public health.
The biomarkers in the Nightingale Health NMR platform are
typically denoted ‘metabolic biomarkers’, and most prior studies on
the data have focused on cardiometabolic diseases. Our analyses
reveal that many of these biomarkers capture risk for many other
diseases as well. This includes the future onset of diseases of the joints,
bones, lungs, many different cancers as well as many mental disorders
diseases and severe infectious diseases. These results explain earlier
reports on strong associations of the NMR biomarkers with all-cause
mortality25, since many of the biomarkers are associated broadly with
leading causes of morbidity and mortality. Widespread associations
across different diseases are known for inflammatory biomarkers such
as GlycA26,27, but it has not previously been shown for circulating fatty
acids, amino acids or many detailed lipoprotein measures. For example, MUFA% was the biomarker associated across the highest number
of endpoints and showed similar disease clustering as GlycA. Our
results of widespread disease associations for many fatty acid ratios
may suggest that these biomarkers should be considered as markers of
systemic inflammation more so than of recent diet.
Plasma metabolites are increasingly understood to link to
multimorbidities8,27. This is strongly reinforced by our discovery of
biomarker associations with the full spectrum of common diseases. We
observed that a broad range of diseases with different pathophysiology were characterised by similar biomarker association profiles. For
example, severe infectious diseases had similar biomarker signatures
to, for instance, chronic respiratory diseases as well as urinary and
renal diseases. A potential explanation may be that many of the biomarkers reflect the innate immune system’s ability to respond. This
would help to explain why many of the biomarkers were associated
with susceptibility to severe infectious diseases, such as hospitalisation
and death from sepsis, fungal infections and pneumonia9. These
observations illustrate how novel insights beyond individual diseases
can be gained by studying overall biomarker signatures and numerous
disease outcomes simultaneously. The genomic data in UK Biobank
may help to elucidate causality of these results via Mendelian
randomisation11,17.
The striking similarity of the biomarker risk profiles across various
diseases might pose challenges to certain clinical applications requiring high disease specificity. However, it is ideal when aiming to use the
biomarker panel to assess the risk of multiple diseases and overall
health status simultaneously based on a single measurement. This
could potentially be used for individualised health assessment at scale
to prioritise high-risk individuals for further examinations and guide
9
Article
https://doi.org/10.1038/s41467-023-36231-7
1st tertile (39−53, statin use 6%)
2nd tertile (54−61, statin use 17%)
3rd tertile (62−71, statin use 30%)
A41 Sepsis
C34 Lung cancer
F32 Depression
Cholesterol
Cholesterol
Cholesterol
Apolipoproteins
Apolipoproteins
Apolipoproteins
Triglycerides
Triglycerides
Triglycerides
Fatty acid ratios
Fatty acid ratios
Fatty acid ratios
Inflammation
Inflammation
Inflammation
Clinical LDL−C
HDL−C
Total−C
VLDL−C
ApoA1
ApoB
ApoB/ApoA1
Total triglycerides
DHA %
Omega−3 %
Omega−6 %
Omega−6/Omega−3
MUFA %
PUFA %
PUFA/MUFA
SFA %
Glycoprotein acetyls
0.6
1.0
1.4
1.8
G47 Sleep disorders
0.6
1.0
1.4
1.8
I21 Myocardial infarction
0.6
1.0
1.4
1.8
1.0
1.4
1.8
M81 Osteoporosis
Cholesterol
Cholesterol
Cholesterol
Apolipoproteins
Apolipoproteins
Apolipoproteins
Triglycerides
Triglycerides
Triglycerides
Fatty acid ratios
Fatty acid ratios
Fatty acid ratios
Inflammation
Inflammation
Inflammation
Clinical LDL−C
HDL−C
Total−C
VLDL−C
ApoA1
ApoB
ApoB/ApoA1
Total triglycerides
DHA %
Omega−3 %
Omega−6 %
Omega−6/Omega−3
MUFA %
PUFA %
PUFA/MUFA
SFA %
Glycoprotein acetyls
0.6
1.0
1.4
1.8
0.6
1.0
1.4
1.8
0.6
Hazard ratio (95% CI), per 1−SD increment
Fig. 7 | Age-stratified biomarker profiles for the onset of various types of diseases. Biomarker profiles stratified by age tertiles: 1st tertile (3–53 years of age; dark
blue), 2nd tertile (54–61 years of age; red) and 3rd tertile (62–71 years of age; green).
Results are shown for 17 biomarkers across six disease examples: A41 Sepsis
(n = 117,806, 2986 events), C34 Lung cancer (n = 117,964, 1210 events), F32
Depression (n = 116,993, 5455 events), G47 Sleep disorders (n = 117,325, 1865
events), I21 Myocardial infarction (n = 116,797, 2523 events) and M81 Osteoporosis
(n = 117,538, 3326 events). Results for the remaining 20 biomarkers are shown in
Supplementary Fig. 19. Data are presented as hazard ratios and 95% confidence
Nature Communications | (2023)14:604
intervals (CI), per SD-scaled biomarker concentrations. The models were adjusted
for age, sex and UK biobank assessment centre, using age as the timescale of the
Cox proportional hazards regression. Filled points indicate statistically significant
associations (p < 5e-5), and hollow points non-significant ones. Similar forest plots
for all 249 NMR biomarkers across all endpoints analysed are provided in the
biomarker-disease atlas webtool. DHA indicates docosahexaenoic acid, MUFA
monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty
acids. Source data are provided as a Source Data file.
10
Article
preventative actions. In fact, a recently published study28 demonstrated the potential of the NMR biomarker profiles to predict
multi-disease outcomes, showing predictive improvements over
comprehensive clinical risk factors which were largely shown to
translate into clinical utility. As such, this could have many applications
in clinical settings and provide an attractive tool for multi-disease
risk screening.
Our biomarker-disease atlas published with this paper can be used
to rapidly corroborate or refute many prior biomarker studies. For
instance, we replicate the recent reports on higher branched-chain
amino acid concentrations associated with lower risk for Alzheimer’s
disease and dementia29. The event numbers for these neurodegenerative diseases in UK Biobank alone are similar to those in the metaanalysed eight cohorts. The biomarker-disease atlas may also be used
to put into question other reported biomarker discoveries, such as
branched-chained amino acids in relation to risk for pancreatic
cancer:30 the association was essentially flat in UK Biobank despite a
similar number of events. These examples illustrate how the
biomarker-disease atlas may speed up research and serve as a starting
point for analyses that yield deeper aetiological insights and clinical
context, much as widely available GWAS summary statistics transformed the interpretation of genetic studies. We note that the availability of the NMR biomarker data in UK Biobank does not diminish the
relevance of having these data in smaller cohorts, both for replication
and for complementary study designs. For example, the precise estimates of biomarker associations in UK Biobank can make analyses of
smaller cohorts and trials more interpretable in relation to longitudinal
sampling and intervention effects.
Metabolic profiling of all 500,000 baseline plasma samples in UK
Biobank is underway. This will greatly expand the possibilities for
studying rarer diseases and prediction of short-term risk, as well as
open possibilities for analyses focusing on individuals with prevalent
disease and multi-morbidity trajectories. Coupled with the rich genomic data, clinical chemistry and proteomics measures, imaging, complete health-records, and other health-related data that are continually
added to the UK Biobank resource, the NMR biomarker data will
enhance the possibilities for scientific discovery and is set to yield
important findings for public health and clinical use. The data are
available to approved researchers through similar access protocols as
existing UK Biobank data (http://ukbiobank.ac.uk/).
Methods
UK Biobank cohort
The UK Biobank study was approved by the North West Multi-Centre
Research Ethics Committee and all participants provided written
informed consent. The study protocol is available online (https://www.
ukbiobank.ac.uk). The biomarker profiling of plasma samples by NMR
spectroscopy was approved under UK Biobank Project 30418.
The UK Biobank resource is a globally accessible biomedical
database of half a million UK participants aged 40–69 years at
baseline1. Baseline characteristics of the full cohort and the subset with
available NMR biomarker data are provided in Table 1. A large variety
of health information has been collected for each participant. For
instance, the database includes questionnaire data on participant’s
socio-economic and lifestyle factors, cognitive tests, imaging data,
heart and lung function measures, body size and composition measures. Extensive genomic data is available, with genotyping array and
exome-sequencing data available for all participants, and wholegenome sequencing under way2.
The UK Biobank blood sample collection was undertaken at
baseline in 22 local assessment centres across the UK between 2007
and 2010. The blood sample handling and storage protocol has been
previously described31. Prior to the measurement of the NMR biomarkers, 35 biomarkers have been measured from blood and urine
samples by clinical chemistry4,5.
Nature Communications | (2023)14:604
https://doi.org/10.1038/s41467-023-36231-7
Plasma biomarker profiling by NMR
Nightingale Health Plc. is performing biomarker profiling of baseline
plasma samples for all 500,000 participants in the UK Biobank. Details
of the Nightingale Health NMR biomarker platform have been described previously7,19. The main steps in the experimental procedures are
illustrated in Supplementary Fig. 1. The biomarker measurements took
place in Finland between 2019 and 2020 using six NMR spectrometers.
The first data release covers biomarker measurements from a random
selection of 118,461 EDTA plasma samples from the baseline recruitment. In addition, around 4000 EDTA plasma samples from repeat
assessments are included in the same data release, with both baseline
and repeat-visit sample measured for ~1500 participants. The NMR
biomarker dataset has been made available for the research community through the UK Biobank in March 2021.
All sample analysis processes were performed according to the
standard operating procedures that are part of Nightingale Health’s EN
ISO 13485 certified Quality Management System (certified by DEKRA
Certification B.V. Nightingale Health measured all plasma samples with
a CE-marked In Vitro Diagnostic Medical Device. At time of completion
of UK Biobank phase 1 samples, 37 of the biomarkers in the panel were
CE-marked and certified for diagnostics use. In order to facilitate
translational applications and visualisation of the results, we focused
on this set of 37 clinically validated biomarkers in the examples highlighted in the paper, as they span most of the different metabolic
pathways measured by the NMR platform. Complete results for all 249
biomarkers measured are provided in the biomarker atlas webtool.
Plasma sample preparation. EDTA plasma samples from aliquot 3
were prepared in 96-well plates by UK Biobank laboratory (Stockport,
UK). At least 90 μL of plasma was aliquoted in each well using TECAN
freedom EVO 150 robotic liquid handlers, which have coefficients of
variation (CV) in pipetting volume at <0.75% across 8 tips. The plasma
samples were shipped to Nightingale Health laboratories in Finland in
96-well plates on dry ice in batches of 5000–20,000 samples. No
selection criteria were applied to the sampling and the 118,461 samples
are therefore a random subset of the full cohort.
Samples were stored in a freezer at −80 °C at Nightingale Health
laboratories after arrival from UK Biobank laboratory. Before preparation, frozen samples were slowly thawed at +4 °C overnight, and
then mixed gently and centrifuged (3 min, 3400 × g, +4 °C) to remove
possible precipitate. Aliquots of each sample were transferred into
3-mm outer-diameter NMR tubes and mixed in 1:1 ratio with a phosphate buffer (75 mM Na2HPO4 in 80%/20% H2O/D2O, pH 7.4, including
also 0.08% sodium 3-(trimethylsilyl) propionate-2,2,3,3-d4 and 0.04%
sodium azide) automatically with an automated liquid handler (PerkinElmer Janus Automated Workstation).
NMR spectroscopy. The plasma samples were measured using six
500 MHz NMR spectrometers (Bruker AVANCE IIIHD). Measurements
were conducted blinded prior to the linkage to the UK Biobank health
outcomes. The prepared plasma samples on 96-well plates were loaded onto a cooled sample changer, which maintains the temperature
of samples waiting to be measured at +6 °C. Two NMR spectra were
recorded for each plasma sample. The first spectrum is a presaturated
proton spectrum, which features resonances arising mainly from
proteins and lipids within various lipoprotein particles. The second
spectrum is a Carr-Purcell-Meiboom-Gill T2-relaxation-filtered spectrum where most of the broad macromolecule and lipoprotein lipid
signals are suppressed, leading to enhanced detection of lowmolecular-weight metabolites.
Quantified biomarkers. The biomarkers were quantified using Nightingale Health’s proprietary software (quantification library 2020),
which simultaneously quantifies 249 metabolic measures per EDTA
plasma sample, comprising 168 absolute and 81 ratio measures
11
Article
(Supplementary Fig. 2). All the biomarkers are of known-identity. The
biomarker measures include routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty acid composition, and various low-molecular-weight metabolites such as amino
acids, ketone bodies and glycolysis metabolites quantified in molar
concentration units. For 14 lipoprotein subclasses, the lipid concentrations and composition are measured in terms of triglycerides,
phospholipids, total cholesterol, cholesterol esters, and free cholesterol, and total lipid concentration within each subclass. The majority
of the biomarkers are measured in absolute concentration units
(mmol/L). The 37 biomarkers in the panel which have been certified for
diagnostics use (CE-marked) are marked by asterisks in Supplementary
Fig. 2. The average biomarker detection rate was >99% across the
plasma samples.
The quality control protocol is described in Supplementary
Methods and illustrated in Supplementary Fig. 3. The distribution of
coefficients of variation of the biomarkers for UK Biobank’s blind
duplicate samples as well as Nightingale Health’s internal control
samples is shown in Supplementary Fig. 4. The coefficients of variation
for each biomarker is given in the UK Biobank data resource (https://
biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220). This resource
also contains distribution plots showing the consistency over consecutive shipment batches and in different NMR spectrometers, as well
as scatter plots on the technical repeatability from blinded duplicate
samples and the biological consistency in repeat-visit samples drawn
from the same individuals four years apart. These technical and biological repeatability assessments are illustrated with GlycA as an
example in Supplementary Fig. 5. Supplementary Methods further
contain notes about the quality flags for samples and biomarkers as
well as general recommendations for data processing in relation to
epidemiological analyses.
Plasma sample dilution issue. All UK Biobank blood samples are
known to suffer from unintended dilution during the initial sample
storage process at UK Biobank facilities. Prior reports have suggested
that samples from aliquot 3, used for the NMR measurements, suffer
from 5-10% dilution4. The dilution is believed to come from mixing of
participant samples with water due to seals that failed to hold a system
vacuum in the automated liquid handling systems. While this issue is
likely to have an impact on some of the absolute biomarker concentration values, it is expected to have limited impact on most epidemiological analyses. However, we recommend that this aspect is
considered when conducting analyses that rely on absolute concentrations, such as stratification based on biomarker concentration
cutpoints. This may also cause challenges to compare distributions of
biomarker concentrations with those observed in other cohort studies.
We, therefore, caution against using the concentrations observed in
UK Biobank as reference levels for translational applications.
Comparison to clinical chemistry. The consistency between lipids,
apolipoproteins, creatinine, albumin and glucose measured by routine
clinical chemistry and Nightingale Health NMR is illustrated in Supplementary Fig. 6. For these comparisons, it is important to note that
the clinical chemistry in UK Biobank was measured from serum samples, primarily from aliquot 1, while the NMR biomarkers were measured from EDTA plasma samples from aliquot 3. The different aliquots
are affected by different degrees of dilution, with aliquot 3 being
5–10% diluted while aliquot 1 has almost no dilution4. Supplementary
Fig. 6 therefore also shows the measurement consistency in the FinHealth 2017 study, without the dilution issue. This study is a
population-based cohort under the Finnish Institute for Health and
Welfare (THL) Biobank with n ~ 6000. In the FinHealth 2017 cohort,
clinical chemistry assays were measured from frozen serum samples
soon after the cohort survey and the NMR biomarkers one year later
Nature Communications | (2023)14:604
https://doi.org/10.1038/s41467-023-36231-7
from frozen samples using the Nightingale Health platform on 350 μL
aliquots of serum.
Correlations between the clinical chemistry assays and NMR were
high in both cohorts, but the overall consistency was weaker in UK
Biobank compared with the FinHealth 2017 study. In particular, the
absolute concentrations were deviating more from the diagonal in UK
Biobank in than in the FinHealth 2017 study, owing to the sample
dilution issue in UK Biobank. Other aspects contributing to mismatch
in absolute concentrations in UK Biobank are subtle differences in
biomarker levels between serum and EDTA plasma and longer differences in sample storage time. The consistency of the NMR biomarkers
with clinical chemistry in the FinHealth 2017 study is in line with earlier
studies that have reported correlation coefficients R > 0.92. A recent
paper reported correlations of the same NMR biomarkers with clinical
chemistry for FINRISK cohorts under the THL Biobank to be R~0.95 for
the newest sample collection, and R~0.90 for the oldest sample
collections20. Note that ‘Clinical LDL cholesterol’ is the NMR-based
measure that provides concentrations consistent with clinical chemistry and the Friedewald equation for LDL-cholesterol. We further note
that the correlation coefficient for albumin was weaker in the UK
Biobank than observed for the other clinical chemistry measures.
However, the associations of albumin with disease outcomes were
broadly similar for albumin for both assays as shown in Fig. 5.
Comparisons of the NMR biomarkers with overlapping biomarkers from commercial mass-spectrometry assays and gas chromatography fatty acid assays in smaller cohorts are described
in Supplementary Methods and scatter plots of the consistency illustrated in Supplementary Figs. 7 and 8.
Disease outcome definitions
Prevalent, incident and mortality disease outcomes were derived from
UK Hospital Episode Statistics data and national death registries. A
diagnosis in hospital or death record formed the basis of the disease
endpoint definition. Primary care records were not used. Disease
endpoints were defined based on the first occurrence of 3-character
ICD-10 code using the hospital inpatient and death register data (January 2021 update). To extend the follow-time prior to the introduction
of ICD-10 in 1995, ICD-9 codes were mapped to the corresponding
3-character ICD-10 codes using general equivalence mappings from
Center for Disease Control (https://ftp.cdc.gov/pub/Health_Statistics/
NCHS/Publications/ICD10CM/2018/).
A prevalent event was defined as an event that occurred before
the date of participant’s baseline visit when a blood sample was collected. Individuals with corresponding prevalent event for each outcome were excluded from the analysis of incident disease, but not for
analyses of mortality outcomes. The occurrence of both primary and
secondary diagnoses codes was considered to form the endpoints. The
follow-up of hospitalisations ended on November 30, 2020 in England,
October 31, 2020 in Scotland, and February 28, 2018 in Wales. The
follow-up of death registry ended on November 30, 2020. We omitted
disease outcomes with fewer than 50 cases from the analyses. This led
to a total of 648 prevalent, 717 incident and 77 mortality outcomes for
the study population with NMR biomarker data available.
For the examples highlighted in this paper, we focused on 556
incident disease outcomes from ICD-10 chapters A-N. The selection of
chapters A-N excludes pregnancy-related outcomes, conditions originating in the perinatal period and congenital malformations, deformations and chromosomal abnormalities (chapters O-Q) as there were
not enough incident events passing the criteria of over 50 events to be
included in the analyses. Chapters R-U (symptoms, signs and laboratory findings not elsewhere classified, injuries, accidents and factors
influencing health status and contact with health care services and
codes for special purposes) were excluded to place the focus on
common diseases.
12
Article
Biomarker association analyses across all endpoints
For the disease association analyses, biomarker values outside four
interquartile ranges from median were considered outliers and
excluded from the analyses. Furthermore, biomarker values were
corrected for the NMR spectrometer used for the measurements by
fitting a linear regression model with log1p-transformed concentrations as the outcome and spectrometer as the predictor. Scaled residuals from this regression were used as predictors in the association
analyses. Log1p stands for the natural logarithm of 1 + x.
We used Cox proportional hazard modelling to estimate associations between biomarkers and incident disease outcomes (hospitalisation or death) across all endpoints with 50 or more events. The
models were adjusted for sex and UK biobank assessment centre,
using age as the time scale of the Cox proportional hazards regression.
Associations for each biomarker-disease pair were computed separately. For biomarker association testing with prevalent diseases, we
used logistic regression models adjusted for age, sex and assessment
centre. Hazard ratios and odds ratios are reported per SD increment in
the log1p-transformed biomarker concentrations in order to allow
comparison of association magnitudes for measures with different
units and concentration range. Sex-specific analyses were conducted
for 148 female-specific and 18 male-specific diseases (Supplementary
Table 3). These association analyses were performed in a subset containing only the specific sex, using the same approach without the
inclusion of sex as a covariate. We also performed analyses by stratifying the UK biobank population into age tertiles (1st tertile 39-53 years
of age, statin use 6%; 2nd tertile 54-61 years of age, statin use 17%; 3rd
tertile 62-71 years of age, statin use 30%).
In the biomarker-disease atlas, results are reported for all conducted analyses and the webtool allows to filter by a desired significance level. In this paper, we use a multiple testing-corrected
significance level of 5 × 10−5 for reporting statistically significant associations, i.e. correcting for 1000 independent tests to account for both
high correlation between the NMR biomarkers (~50 independent
tests7) and correlations between the disease endpoints analysed.
Clustering analyses
For clustering analyses, a dendrogram and heatmap were computed
based on the association magnitudes of the 37 biomarkers with three
diseases from each ICD-10 chapter from A to N. The diseases were
selected based on the highest number of significant biomarker associations in each ICD-10 chapter. The 37 biomarkers selected are the
ones clinically validated in the Nightingale Health NMR platform.
Biomarkers are clustered in the dendrogram based on disease association profiles, and diseases are clustered based on biomarker profiles, using complete linkage clustering based on linear correlation
between the association signatures.
https://doi.org/10.1038/s41467-023-36231-7
Fourteen disease endpoints were used for replication analyses in
THL Biobank, selected based on the outcome data made available to
Nightingale Health Plc. The disease outcome definitions were predefined by THL Biobank based on a combination of national hospital
and cause-of-death registries (Supplementary Table 2). The registrybased follow-up cover virtually all diseases leading to hospitalisation or
death in Finland. Follow-up data for the present study were until the end
of 2016. For the replication analyses, we defined similar endpoints in UK
Biobank based on the ICD-10 codes listed in Supplementary Table 2.
The association analyses were for incident disease, so individuals
with prevalent disease of the same endpoint were omitted. The hazard
ratios were computed separately in each cohort using Cox proportional hazards regression adjusted for sex and using age as the time
scale of the regression. Results from the individual cohorts were metaanalysed using inverse variance weighting. Similar to the analyses in
UK Biobank, hazard ratios are reported in SD-scaled units.
Reporting summary
Further information on research design is available in the Nature
Portfolio Reporting Summary linked to this article.
Data availability
The Nightingale Health NMR biomarker data have been released to
the UK Biobank resource in spring 2021 (https://biobank.ndph.ox.ac.
uk/showcase/label.cgi?id=220). The UK Biobank data are available
for approved researchers through the UK Biobank data-access protocol. NMR spectral data are not available as they are outside of the
scope of the Nightingale-UK Biobank initiative. Instructions for the
data access process, timeframe and restrictions imposed on the data
are described at https://www.ukbiobank.ac.uk/enable-your-research/
apply-for-access. The average number of weeks from application
submission to data release is 15 weeks for UK Biobank. Nightingale
Health NMR biomarker data from FINRISK and Health 2000 cohorts,
used for replication in this study, are available for approved
researchers through THL Biobank. Instructions for the data access
process is provided at https://thl.fi/en/web/thl-biobank/forresearchers/application-process. We provide access to all
biomarker-disease summary statistics for non-commercial use
through an interactive webtool https://nightingalehealth.com/atlas
(CCBY-NC-ND 4.0 license). Source data are provided with this paper.
Code availability
Code used in this study is available at: https://github.com/
NightingaleHealth/ukb-nightingale-biomarker-atlas. Analyses were
performed using R (completed and tested with version 4.1.1).
References
Replication in additional cohorts
To replicate biomarker associations from the UK Biobank, we used
data from five prospective population-based studies administered
under the Finnish Institute for Health and Welfare (THL) Biobank:
FINRISK 1997, FINRISK 2002, FINRISK 2007, FINRISK 2012 and Health
2000. Each cohort is an independent random sample drawn from
people aged 25-98 (25–74 in FINRISK, 30 and over in Health 2000) in
the Finnish population. Baseline characteristics of these cohorts are
provided in Supplementary Table 1. The study participants are unique
in each cohort. Baseline blood samples were collected for ~85% of all
participants enroled. Venous blood was drawn non-fasting, but with
recommended minimum of 4-h fast. Biomarker profiling by the
Nightingale Health NMR platform was conducted from frozen serum
samples for all participants during 201820. The cohort studies were
approved by the Coordinating Ethical Committee of the Helsinki and
Uusimaa Hospital District, Finland. Written informed consent was
obtained from all participants.
Nature Communications | (2023)14:604
1.
2.
3.
4.
5.
6.
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and
old age. PLOS Med. 12, e1001779 (2015).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping
and genomic data. Nature 562, 203–209 (2018).
Szustakowski, J. D. et al. Advancing human genetics research and
drug discovery through exome sequencing of the UK Biobank. Nat.
Genet. 53, 942–948 (2021).
Allen, N. E. et al. Approaches to minimising the epidemiological
impact of sources of systematic and random variation that may
affect biochemistry assay data in UK Biobank. Wellcome Open Res.
5, 222 (2021).
Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021).
Khera, A. V. et al. Genome-wide polygenic scores for common
diseases identify individuals with risk equivalent to monogenic
mutations. Nat. Genet. 50, 1219–1224 (2018).
13
Article
7.
Würtz, P. et al. Quantitative serum nuclear magnetic resonance
metabolomics in large-scale epidemiology: a primer on -omic
technologies. Am. J. Epidemiol. 186, 1084–1096 (2017).
8. Pietzner, M. et al. Plasma metabolites to profile pathways in noncommunicable disease multimorbidity. Nat. Med. 27,
471–479 (2021).
9. Julkunen, H., Cichońska, A., Slagboom, P. E. & Würtz, P. Nightingale
Health UK Biobank Initiative. Metabolic biomarker profiling for
identification of susceptibility to severe pneumonia and COVID-19
in the general population. eLife 10, e63033 (2021).
10. Smith, C. J. et al. Integrative analysis of metabolite GWAS illuminates the molecular basis of pleiotropy and genetic correlation.
eLife 11, e79348 (2022).
11. Borges, M. C. et al. Role of circulating polyunsaturated fatty acids
on cardiovascular diseases risk: analysis using Mendelian randomization and fatty acid genetic association data from over 114,000
UK Biobank participants. BMC Med. 20, 210 (2022).
12. Liu, J. et al. Longitudinal analysis of UK Biobank participants suggests age and APOE-dependent alterations of energy metabolism in
development of dementia. medRxiv https://doi.org/10.1101/2022.
02.25.22271530 (2022).
13. Bragg, F. et al. Predictive value of circulating NMR metabolic biomarkers for type 2 diabetes risk in the UK Biobank study. BMC Med.
20, 159 (2022).
14. Richardson, T. G. et al. Characterising metabolomic signatures of
lipid-modifying therapies through drug target mendelian randomisation. PLOS Biol. 20, e3001547 (2022).
15. Nag, A. et al. Assessing the contribution of rare-to-common proteincoding variants to circulating metabolic biomarker levels via
412,394 UK Biobank exome sequences. medRxiv https://doi.org/10.
1101/2021.12.24.21268381 (2021).
16. Bell, J. A. et al. Effects of general and central adiposity on circulating
lipoprotein, lipid, and metabolite levels in UK Biobank: a multivariable Mendelian randomization study. Lancet Reg. Health - Eur.
21, 100457 (2022).
17. Fang, S., Holmes, M. V., Gaunt, T. R., Smith, G. D. & Richardson, T. G.
Constructing an atlas of associations between polygenic scores
from across the human phenome and circulating metabolic biomarkers. eLife. 11, e73951 (2022).
18. Ritchie, S. C. et al. Quality control and removal of technical variation
of NMR metabolic biomarker data in ~120,000 UK Biobank participants. Sci Data. 10, 64 (2023).
19. Soininen, P., Kangas, A. J., Würtz, P., Suna, T. & Ala-Korpela, M.
Quantitative serum nuclear magnetic resonance metabolomics in
cardiovascular epidemiology and genetics. Circ. Cardiovasc.
Genet. 8, 192–206 (2015).
20. Tikkanen, E. et al. Metabolic biomarker discovery for risk of peripheral artery disease compared with coronary artery disease:
lipoprotein and metabolite profiling of 31,657 individuals from 5
prospective cohorts. J. Am. Heart Assoc. 10, e021995 (2021).
21. Wittemans, L. B. L. et al. Assessing the causal association of glycine
with risk of cardio-metabolic diseases. Nat. Commun. 10,
1060 (2019).
22. Holmes, M. V. et al. Lipids, lipoproteins, and metabolites and risk of
myocardial infarction and stroke. J. Am. Coll. Cardiol. 71,
620–632 (2018).
23. Würtz, P. et al. Metabolomic profiling of statin use and genetic
inhibition of HMG-CoA reductase. J. Am. Coll. Cardiol. 67,
1200–1210 (2016).
24. Sliz, E. et al. Metabolomic consequences of genetic inhibition of
PCSK9 compared with statin treatment. Circulation 138,
2499–2512 (2018).
25. Deelen, J. et al. A metabolic profile of all-cause mortality risk
identified in an observational study of 44,168 individuals. Nat.
Commun. 10, 3346 (2019).
Nature Communications | (2023)14:604
https://doi.org/10.1038/s41467-023-36231-7
26. Ritchie, S. C. et al. The biomarker GlycA is associated with chronic
inflammation and predicts long-term risk of severe infection. Cell
Syst. 1, 293–301 (2015).
27. Kettunen, J. et al. Biomarker glycoprotein acetyls is associated with
the risk of a wide spectrum of incident diseases and stratifies
mortality risk in angiography patients. Circ. Genom. Precis. Med.
11, e002234 (2018).
28. Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
29. Tynkkynen, J. et al. Association of branched‐chain amino acids and
other circulating metabolites with risk of incident dementia and
Alzheimer’s disease: a prospective study in eight cohorts. Alzheimers Dement. 14, 723–733 (2018).
30. Mayers, J. R. et al. Elevation of circulating branched-chain amino
acids is an early event in human pancreatic adenocarcinoma
development. Nat. Med. 20, 1193–1198 (2014).
31. Elliott, P. & Peakman, T. C. on behalf of UK Biobank. The UK Biobank
sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int. J. Epidemiol.
37, 234–244 (2008).
Acknowledgements
The authors are grateful to UK Biobank (Project #30418) and THL Biobank
(project #BB2016_86) for access to data to undertake this study. The
authors thank all biobank participants for their generous contribution to
generating this resource for the scientific community. The work was
funded by Nightingale Health Plc.
Author contributions
H.J., A.C., A.J.K., P.S., J.B., P.W. designed research; H.J. and A.C. contributed to statistical analyses and interpretation of results; M.T., H.K.,
K.N., V.M., J.N.-K., P.S., A.J.K. contributed to biomarker measurements
and quality control; M.P., V.S., P.J., A.L., and K.K. contributed data or
results for replication; H.J., A.C., J.B., and P.W. contributed to the interpretation of results and wrote the manuscript. All authors reviewed the
manuscript.
Competing interests
H.J., M.T., H.K., K.N., V.M., J.N.-K., A.J.K., P.S., J.B., and P.W. are
employees of Nightingale Health Plc, and hold shares or stock options in
Nightingale Health Plc. A.C. is former employee of Nightingale Health
Plc. V.S. has received an honorarium for consulting from Sanofi and has
ongoing research collaboration with Bayer Ltd outside this work. The
remaining authors declare no competing interests.
Additional information
Supplementary information The online version contains
supplementary material available at
https://doi.org/10.1038/s41467-023-36231-7.
Correspondence and requests for materials should be addressed to
Heli Julkunen or Peter Würtz.
Peer review information Nature Communications thanks Tom Richardson, Timothy Ebbels, and the other, anonymous, reviewer(s) for their
contribution to the peer review of this work. Peer reviewer reports are
available.
Reprints and permissions information is available at
http://www.nature.com/reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
14
Article
https://doi.org/10.1038/s41467-023-36231-7
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons license, and indicate if
changes were made. The images or other third party material in this
article are included in the article’s Creative Commons license, unless
indicated otherwise in a credit line to the material. If material is not
included in the article’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright
holder. To view a copy of this license, visit http://creativecommons.org/
licenses/by/4.0/.
© The Author(s) 2023
Nature Communications | (2023)14:604
15
Publication IV
Nightingale Health Biobank Collaborative Group. Metabolomic and genomic
prediction of common diseases in 700,217 participants in three national
biobanks. Nature Communications, November 2024.
© 2024 The Author(s). This is an open access article distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives License
(http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any non-commercial
use, sharing, distribution and reproduction in any medium or format, as long as
appropriate credit is given to the original author(s).
161
Article
https://doi.org/10.1038/s41467-024-54357-0
Metabolomic and genomic prediction of
common diseases in 700,217 participants in
three national biobanks
Received: 12 October 2023
Nightingale Health Biobank Collaborative Group*
1234567890():,;
1234567890():,;
Accepted: 8 November 2024
Check for updates
Identifying individuals at high risk of chronic diseases via easily measured
biomarkers could enhance efforts to prevent avoidable illness and death.
Using ’omic data can stratify risk for many diseases simultaneously from a
single measurement that captures multiple molecular predictors of risk. Here
we present nuclear magnetic resonance metabolomics in blood samples from
700,217 participants in three national biobanks. We built metabolomic scores
that identify high-risk groups for diseases that cause the most morbidity in
high-income countries and show consistent cross-biobank replication of the
relative risk of disease for these groups. We show that these metabolomic
scores are more strongly associated with disease onset than polygenic scores
for most of these diseases. In a subset of 18,709 individuals with metabolomic
biomarkers measured at two time points we show that people whose scores
change have different risk of disease, suggesting that repeat measurements
capture changes both to health status and disease risk possibly due to treatment, lifestyle changes or other factors. Lastly, we assessed the incremental
predictive value of metabolomic scores over existing clinical risk scores for
multiple diseases and found modest improvements in discrimination for several diseases whose clinical utility, while promising, remains to be determined.
Identifying individuals at elevated risk of disease can help guide the
use of preventative interventions. For example, in the UK the multivariable QRISK score is used to identify individuals at high risk of
cardiovascular disease who should adjust their lifestyle or begin taking
cholesterol-lowering or blood pressure-reducing medicine1. This concept of combining multiple measurements or risk factors into a single
score has been extended to the use of ‘omic data, such as polygenic
scores (PGS)2,3. By adding up the contribution of different genetic
variants associated with different diseases, PGS can identify individuals
at elevated risk for multiple diseases4 with one measurement (e.g., a
GWAS array or genome sequence), and offer complementary information to traditional risk factors5,6. Metabolomic scores, based on
adding up the contributions of multiple biomarkers measured from a
blood sample, for example via nuclear magnetic resonance
spectroscopy7–9, have also been shown to predict many common
diseases10,11 including cardiovascular disease, type 2 diabetes12, and all-
cause mortality13. Furthermore, since the metabolomic scores may
change in response to lifestyle and treatment (in contrast to PGS), they
can also track changes in people’s risk profiles. A few studies have
suggested complementary value for genetics and metabolomics in
cardiovascular disease and type 2 diabetes14,15, but the combined use of
these ‘omics-based risk predictors has not yet been evaluated at scale.
Here, we generated nuclear magnetic resonance metabolomic
biomarker data in blood samples from apparently healthy individuals
from three national biobanks with follow-up data on clinical outcomes.
We trained risk prediction scores for the 12 leading causes of disabilityadjusted life years (DALYs) in high-income countries. We investigated
the relative performance of these metabolomic scores, PGS and clinical scores in different diseases and time scales. We replicated the
performance of metabolomic scores across the three biobanks in the
study and assessed the value of multiple metabolomic time points in
two of the biobanks.
*A list of authors and their affiliations appears at the end of the paper.
Nature Communications | (2024)15:10092
1
Article
https://doi.org/10.1038/s41467-024-54357-0
Results
27 for stroke, 28 for alcoholic liver disease, 29 for chronic obstructive
pulmonary disease (COPD), 30 for liver cirrhosis, 31 for myocardial
infarction, 33 for diabetes, and 35 for depression (model coefficients
for each biomarker in each score are shown in Supplementary Data 3
and Fig. S3). We evaluated the performance of these scores in the other
half of the UK Biobank, as well as the Estonian and Finnish THL biobanks. As we quantify the biomarkers in absolute concentration units
(e.g., mmol/l), we can directly use the variable coefficients estimated in
the UK Biobank to calculate scores in the other two datasets, without
normalizing the biomarkers within each study separately. This is distinct from common practice in other ‘omics analyses, where within
cohort normalization is essential16,17. Figure S4 shows that we obtain
highly similar results with these normalization steps, but we here
present results without them to better mimic predicting a new individual’s risk without additional information (e.g., batch corrections, or
cohort means and variances).
Metabolomic risk prediction across top sources of morbidity
Building metabolomic scores. We measured metabolomic biomarkers via nuclear magnetic resonance spectroscopy in blood samples provided at the time of enrollment from 700,217 participants in
the UK Biobank, Estonian Biobank (EBB), or Finnish THL Biobank, all
with linked comprehensive clinical data (Table 1, Supplementary
Data 1). An overview of the study design is shown in Fig. S1. All three
biobanks contain adults from Northern European countries, with
varying ascertainment, recruitment years, age ranges, and procedures
for extracting outcomes from electronic health records (Methods,
Fig. S2).
We analyzed 12 diseases causing the most morbidity in the WHO
European region in 2019 (excluding falls and back pain, Fig. 1, Table 1,
Supplementary Data 2), which cause more than one-third of all DALYs.
We trained Cox proportional hazards models to predict incidence of
each of these diseases in half of the UK Biobank. We included age and
sex in all models as fixed covariates and allowed the model to select
(via Lasso with tenfold cross-validation) from among 36 metabolomic
biomarkers that have been validated in Europe for use in an in vitro
diagnostic medical device (Methods). For all but two of the diseases
studied, more than half of the biomarkers were included in the scores:
17 for Alzheimer’s disease, 18 for intracerebral hemorrhage, 21 for
colon cancer, 24 for lung cancer, 26 for vascular and other dementias,
Table 1 | Basic characteristics of the participants in the three
national biobanks
Biobank
UK Biobank
Estonian
Biobank
THL Biobank*
Number of participants
477,078
190,785
32,354
Age at blood sample
(median, [IQR])
58.0
[50.0–63.0]
43.0
[31.0–56.0]
51.0
[39.0–61.0]
Females (N (%))
260,253 (54.6)
125,565 (65.8)
17,248 (53.3)
Body mass index (kg/m2;
median, [IQR])
26.7 [24.1-29.8]
25.3
[22.4-29.0]
26.2
[23.6-29.4]
Smoking prevalence (%)
10.5
18.3
34.7
Cholesterol lowering
medication (%)
17.4
10.1
10.7
Follow-up time (median, [IQR])
11.8 [11.0–12.5]
3.2 [2.9–3.7]
13.8 [8.8–15.2]
Recruitment period
2006–2010
2002–2021
1997–2012
Myocardial infarction
2753
441
304
Ischemic stroke
1791
517
281
Intracerebral
hemorrhage
468
117
47
Lung cancer
1259
180
48
Type 2 diabetes
3649
916
484
Chronic obstructive pulmonary disease
3716
732
168
Number of incident
events 4 years after
baseline visit
Alzheimer’s disease
180
58
Vascular and other
dementia
326
175
Depressive disorders
4921
2774
Alcoholic liver disease
322
109
33
Cirrhosis of the liver
321
69
8
Colon and rectum
cancers
1965
293
60
24
See Fig. S2 for age and recruitment year histograms. *See Supplementary Data 1 for characteristics by cohort.
Nature Communications | (2024)15:10092
Baseline age and sex minimally adjusted metabolomic scores and
incident disease. We stratified the three test sets into one percent bins
of the metabolomic score distribution and meta-analyzed the fouryear incidence rates for each disease (Fig. 1A). The risk of incident
disease increased with increasing levels of the metabolomic score
across all the diseases. As has been observed previously4,10, these
curves follow a quantile-logistic function, which rises superexponentially in the tails, making it possible to identify subsets of
individuals that are at much higher risk than average. This effect is
especially dramatic for the scores that most strongly predict disease,
including type 2 diabetes and liver diseases.
Figure 1B shows the performance of the scores by comparing the
relative risk of incident disease in the 10% of individuals with the
highest metabolomic scores (high-risk group, red shaded area, Fig. 1A)
to the remaining population. Again, to avoid needing within-cohort
scaling factors or thresholds, we used the top 10% boundary from our
training data to define this group in the other half of the UK Biobank
and the other two biobanks. This means the proportion of individuals
in the high-risk group varies across the three biobanks (Supplementary
Data 4), but this high-risk group nonetheless had consistently
increased risk across diseases (Fig. 1B). Only depression, alcoholic liver
disease, lung cancer, and COPD showed significant meta-analysis heterogeneity (Cochran’s Q, p < 0.004 to account for multiple testing).
The UK Biobank test set had the highest point estimate of effect size in
only 4 of 12 diseases, demonstrating that the scores are capturing
generalizable risk factors, rather than overfitting to the UK Biobank.
The meta-analysis of the three test sets included hazard ratios of ~10
for two types of liver disease and diabetes, ~4 for COPD and lung
cancer, and ~2.5 for myocardial infarction, stroke and vascular
dementia, and was statistically significant (fixed-effect meta-analysis Z
score test, p < 0.004 adjusted for multiple testing) for all diseases
(Fig. 1B). The pattern of association is similar when considering hazard
ratios per standard deviation in a continuous model (Fig. S4), and
population-wide discrimination, as measured by area under the
receiver-operating characteristic curve (AUC), shows consistent,
though variable, improvement when adding metabolomic scores to
age and sex (Supplementary Data 5).
Sensitivity and subgroup analyses of minimally adjusted scores.
Having demonstrated that it is possible to construct metabolomic
scores that are replicably associated with risk of these diseases, we
next sought to use the diverse data available in these biobanks to
investigate further properties of metabolomic scores.
First, we assessed the performance of scores using all 249 metabolomic biomarkers we measured, rather than the 36 clinically validated biomarkers described above. Only diabetes and COPD showed
consistently improved performance using the extended metabolomics
(Fig. S5), likely because many of the biomarkers are correlated, and our
2
Article
https://doi.org/10.1038/s41467-024-54357-0
A
Myocardial infarction
Ischemic stroke
Intracerebral
hemorrhage
Lung cancer
0.25%
2.5%
0.9%
0.20%
1.5%
2.0%
0.15%
1.5%
0.6%
1.0%
0.10%
1.0%
0.5%
0.0%
0
25
50
75
0.5%
0.05%
0.0%
0.00%
100
0
Type 2 diabetes
25
50
75
100
0.3%
0.0%
0
Chronic obstructive
pulmonary disease
25
50
75
100
0
Alzheimer's disease
25
50
75
100
Vascular and
other dementia
Incidence (%)
0.5%
3%
10%
0.4%
0.10%
0.3%
2%
5%
0.2%
0.05%
1%
0%
0.1%
0.00%
0%
0
25
50
75
100
0
Depressive disorders
25
50
75
100
0.0%
0
Alcoholic liver disease
2.4%
1.5%
2.0%
1.0%
25
50
75
100
0
Cirrhosis of the liver
25
50
75
100
Colon and
rectum cancers
1.00%
1.5%
0.75%
1.0%
1.6%
0.50%
0.5%
0.5%
1.2%
0.25%
0.0%
0
25
50
75
100
0.0%
0
25
50
75
100
0
25
50
75
100
0
25
50
75
100
Percentile of metabolomic score
B
Myocardial infarction
Meta−analysis
Ischemic stroke
UK Biobank
THL Biobank
Estonian Biobank
Intracerebral hemorrhage
Lung cancer
Type 2 diabetes
Chronic obstructive pulmonary disease
Alzheimer's disease
Vascular and other dementia
Depressive disorders
Alcoholic liver disease
Cirrhosis of the liver
Colon and rectum cancers
1
3
5
10
30
Hazard ratio (95% CI), highest risk decile vs. remaining population
Fig. 1 | Association between metabolomic scores and disease onset in three
national biobanks. A Observed incidence of the 12 diseases divided into one percent
bins of the metabolomic score. The observed incidence is shown as a sample size
weighted mean of the 4-year incidence in the three biobank cohorts (n = 481,678). Red
shading shows the top 10% of the metabolomic score (adjusted for age and sex).
Horizontal dashed line shows the population prevalence. B Four-year hazard ratios
of metabolomic scores (adjusted for age and sex) comparing the highest 10% to
the remaining 90% of the study population for 12 diseases (n = 481,678). The dots
represent point estimates (Cox regression estimates for individual cohorts, fixed
inverse-variance weighted mean for meta-analysis) and the horizontal error bars
denote 95% confidence intervals of the hazard ratio. Source data are provided as
a Source Data file.
clinically validated subset captures a large fraction of the total available information in most cases. In some cases, such as Alzheimer’s and
the liver diseases in the EBB, the extended metabolomic scores do not
replicate as well as the simpler scores. Taken together, these results
suggest some diseases may benefit from additional molecular measurements, but care must be taken that they do not capture cohortspecific effects which are less transferrable.
Second, we sought to understand the extent to which our scores
are driven by well-established behavioral risk factors for some of these
diseases, in particular tobacco smoking and alcohol consumption.
Both lung disease scores show attenuated, but still strong, association
when conditioning on pack-years of smoking, suggesting that while
they are partly driven by this behavior, they capture additional information beyond the self-reported variables (Fig. S6). The performance
Nature Communications | (2024)15:10092
3
Article
https://doi.org/10.1038/s41467-024-54357-0
of our lung cancer score is reduced to almost zero in never smokers,
whereas our COPD score still has significant prediction in that group.
Liver cirrhosis is equally well predicted across ever and never drinkers,
and virtually unaffected by an adjustment for daily alcohol units. The
adjusted alcoholic liver disease prediction is somewhat reduced but
remains very strong (Fig. S6).
Third, as we were limited to somewhat coarse ICD-10 based definitions of the diseases we were studying, we examined whether
broader or narrower definitions might change our results by investigating cardiovascular disease more closely. The score trained on the
narrow outcome of myocardial infarction had a 0.96 correlation with
one trained more broadly on ischemic heart disease. This suggests that
for deriving the scores the definition of the disease endpoint is not very
sensitive, likely because the underlying risk factors are broadly shared.
When testing the scores for association in the test datasets, both show
a gradient of increasing effect size for more severe outcomes, from
unstable angina to first myocardial infarction to subsequent myocardial infarction (Fig. S7), suggesting that the scores may be strongest
at predicting severe outcomes.
A
Metabolomics + PGS
PGS
Metabolomics
Myocardial infarction
Ischemic stroke
Intracerebral hemorrhage
Lung cancer
Type 2 diabetes
Chronic obstructive pulmonary disease
Alzheimer's disease
Vascular and other dementia
Depressive disorders
Alcoholic liver disease
Cirrhosis of the liver
Colon and rectum cancers
1
B
3
5
10
Hazard ratio (95% CI), highest risk decile vs. remaining population
Top decile of PGS and metabolomic score
Myocardial infarction
Top decile of PGS, bottom 90% of metabolomic score
Ischemic stroke
Intracerebral hemorrhage
8%
0.6%
4%
6%
3%
Cumulative incidence (%)
4%
0%
2
4
6
8
15%
10%
4
6
8
0%
10
2
Alzheimer's disease
Chronic obstructive
pulmonary disease
5%
1%
0.0%
2
10
12.5%
Type 2 diabetes
20%
3%
2%
0.2%
1%
0%
Lung cancer
4%
0.4%
2%
2%
30
Bottom 90% of PGS
4
6
8
0%
10
2
Depressive disorders
4
6
8
10
2
Cirrhosis of the liver
4
6
8
10
Colon and rectum cancers
8%
1.00%
3%
10.0%
6%
0.75%
7.5%
2%
2%
4%
0.50%
5.0%
1%
0.0%
0.00%
2
4
6
8
1%
2%
0.25%
2.5%
0%
10
2
4
6
8
10
0%
2
4
6
8
0%
10
2
4
6
8
10
2
4
6
8
10
Follow−up time (years)
C
Metabolomics
Myocardial infarction
Ischemic stroke
Intracerebral hemorrhage
4
4
4
2
2
2
1
1
2
Hazard ratio (95% CI), highest risk decile vs. remaining population
PGS
4
6
8
10
1
2
Lung cancer
4
6
8
10
2
10
20
10
5
10
5
1
1
2
4
6
8
10
4
6
8
10
2
Vascular and other dementia
10
5
5
1
1
6
8
10
8
10
8
10
1
2
Alzheimer's disease
10
4
Chronic obstructive
pulmonary disease
Type 2 diabetes
4
6
Depressive disorders
4
2
1
2
4
6
8
10
2
Alcoholic liver disease
4
6
8
10
2
Cirrhosis of the liver
50
20
25
10
1
1
4
6
Colon and rectum cancers
4
2
1
2
4
6
8
10
2
4
6
8
10
2
4
6
8
10
Time to event (years)
Fig. 2 | Associations between metabolomic and polygenic scores and
disease onset. A Ten-year hazard ratios of metabolomic, polygenic or combined
scores comparing the highest risk decile to the remaining study population (UK
Biobank test set, n = 242,492). Dots represent Cox regression estimates and horizontal error bars denote 95% confidence intervals of the hazard ratio. B Risk of
disease incidence after blood sampling for high genetics risk group stratified by
their metabolomic score and for average genetic risk group. Shaded region denotes
95% confidence interval. C Hazard ratios for highest decile of metabolomic or
polygenic scores stratified by time to event. Dots and vertical error bars denote Cox
regression estimates and 95% confidence intervals per bin, shaded region is 95%
confidence interval for a generalized survival model allowing a time-varying effect
using natural splines with 2 knots. All scores adjusted for age and sex (n = 242,492).
Source data are provided as a Source Data file.
Nature Communications | (2024)15:10092
Comparison of hazard ratios for incident disease among metabolomic, genetic, and combined scores
We next compared the performance of metabolomic scores to PGS,
which have received widespread attention for risk stratification to aid
prevention4,18,19. We calculated PGS using variant weights from the PGS
Catalog (Supplementary Data 2), only using scores that were built from
genome-wide association study (GWAS) data that did not include the
biobanks studied here. We again trained models in half the UK Biobank
always including age and sex, and using Lasso to select from: (i) only
the external PGS, (ii) among the metabolomic biomarkers (as above),
or (iii) among both the PGS and the metabolomic biomarkers.
PGS were available for 10 of our diseases, and, as expected, the top
10% high-risk groups were at significantly higher risk than the
remaining 90% (Fig. 2A). However, the hazard ratio of being in the
genetic high-risk group was less than the metabolomic high-risk group
in all diseases except colorectal cancer. In most cases, the best performing model included both genetic and metabolomic scores, suggesting that these two data types capture at least partially
complementary information. A formal interaction test between metabolomic and genetic scores found a significant effect only for type 2
diabetes (Supplementary Data 6), and the small confidence intervals
demonstrated that genetic and metabolomic risk is primarily additive
on the log hazard ratio scale. For six diseases we could also calculate
PGS in the EBB, which replicated the results in the UK Biobank (Fig. S8).
We stratified individuals in the genetic high-risk group by whether
they were also in the metabolomic high-risk group (Fig. 2B). Individuals
at high risk by both PRS and metabolomics scores are indeed at very
elevated risk, but genetically predisposed individuals not in the high
metabolomic risk group have risk similar to (or in some cases less than)
those not in the genetically predisposed group. This is likely because
current PGS capture less than half of the genetic risk for these diseases,
and that unexplained heritability, combined with lifestyle and environmental history, is partially reflected in the metabolomic score. The
metabolomic and genetic scores also have different patterns of correlation between different diseases (Fig. S9). As has been previously
shown, the PGS for different diseases tend to be largely uncorrelated19,
whereas the different metabolomic scores are nearly all correlated
with each other, reflecting multi-morbidity10,13. Combining the two
types of information can yield both improved performance and specificity of risk stratification.
While the three biobanks are dominated by individuals of European ancestry, we did compare the transferability of the metabolomic scores and PGS for 8 endpoints with at least 35 events in multiple
ancestries in the UK Biobank (Fig. S10). The metabolomic scores
remained
significantly
predictive
across
disease-ancestry
4
Article
https://doi.org/10.1038/s41467-024-54357-0
Stayed in top decile of metabolomic score
Myocardial infarction
Left top decile
Joined top decile
Ischemic stroke
Everyone else
Lung cancer
Type 2 diabetes
20%
5%
4%
4%
15%
4%
3%
3%
3%
10%
2%
2%
1%
1%
Cumulative incidence (%)
2%
1%
0%
2
4
6
0%
8
Chronic obstructive
pulmonary disease
2
4
6
8
0%
Vascular and other dementia
5%
2
4
6
8
0%
2
4
6
8
Colon and rectum cancers
Depressive disorders
12.5%
5%
10.0%
2%
4%
7.5%
1%
5.0%
2%
1%
1%
2.5%
0.0%
2%
3%
0%
2
4
6
8
2
4
6
8
0%
2
4
6
8
0%
2
4
6
8
Follow−up time after repeat sample (years)
Fig. 3 | Risk of disease onset for eight diseases stratified by metabolomics
scores at two-time points. Maroon lines show those who were in the high-risk
group at both time points ~5 years apart, green lines show those who were in the
high-risk group at enrollment but had left it by the second time point, orange lines
show those who were in the low risk group at baseline and moved to the high risk
group at follow-up, and black lines show those who were in the low-risk group at
both time points. Data from 18,709 UK Biobank participants with metabolomic
scores from two time points. Shaded areas are 95% confidence intervals, derived
from the standard error of the cumulative hazard. Source data are provided as a
Source Data file.
combinations, though often with weaker effect size estimates than in
the European ancestry group. As has been previously shown, the effect
sizes of PGS were also attenuated in non-European ancestries, and
because they perform worse than metabolomics in Europeans, the
estimate was statistically significant in only 3 out of 19 non-European
comparisons. For the metabolomic scores, 12 out of 19 were significant. This shows how more diverse datasets will be essential not just
to produce transferrable polygenic risk scores, but more generally to
produce multi-omic scores that are as widely useful as possible.
The longer follow-up time in the UK Biobank also allowed us to
compare short-term and long-term prediction from these scores. As
expected, since PGS are fixed throughout life, their hazard ratios
remained constant over follow-up time (Fig. 2C). The relationship
between hazard ratio for the metabolomic score and time to event
varied by disease: diabetes, lung cancer, vascular dementia and alcoholic liver disease scores provide stronger stratification of near-term
risk, but for most diseases the metabolomic scores were stable over
time, like PGS.
the repeat visit, so we fitted a joint risk model with baseline and followup metabolomic score measurements. For diabetes (Cox regression,
baseline HRb = 2.52, 95% CI = 2.24–2.83, pb = 2.45 × 10−54, follow-up
HRf = 1.57, 95% CI = 1.39–1.77, pf = 1.10 × 10−13) and COPD (HRb = 1.52,
95% CI = 1.33–1.72, pb = 2.2 × 10−10, HRf = 1.31, 95% CI = 1.31–1.70,
pf = 1.12 × 10−9) both time points were significantly associated with 10year risk; for all other diseases except vascular and other dementias the
hazard ratio point estimates were all consistently positive but were not
significant due to weaker prediction from the scores and smaller
sample size. This suggests that both a person’s current metabolomic
score value, as well as previously measured score values, contribute
information about risk of disease onset.
To further explore this idea, we considered individuals in the top
10% high-risk groups at the first time point and compared the subset of
that group who remained in the high-risk group at the follow-up time
point to those who had left it. For diabetes, leaving the high-risk group
showed a significant reduction in risk (Cox regression, HR = 2.58, 95%
CI = 1.74–3.84, p = 2.7 × 10−6), after adjusting for baseline score (Fig. 3).
For lung cancer and COPD, risks were reduced fivefold (HR = 4.96, 95%
CI = 1.61-15-23, p = 5.2 × 10−3) and 1.9-fold (HR = 1.92, 95% CI = 1.21–3.06,
p = 5.8 × 10−3) respectively, but estimates were no longer significant
after multiple testing correction. We replicated this analysis in 5038
individuals from the EBB for whom we also profiled a second timepoint
from blood samples donated approximately five years after the baseline survey. We observed the same effect for type 2 diabetes, which was
the only disease for which we had sufficient cases to test (HR = 4.4,
p = 0.002).
While we do not know what caused individual metabolomic
scores to change between time points in these observational cohorts,
we can assess what differences in lifestyle factors are associated with
changes in metabolomic scores. For example, obese individuals who
stayed in the high-risk group for diabetes gained an average 0.18 units
Incident disease risk among participants with two
metabolic scores
We generated metabolomic profiles at a second-time point from blood
samples donated by 18,709 UK Biobank participants who returned for
a repeat visit approximately four and a half years after they initially
enrolled in the study (median time difference 4.4 years, mean 4.3
years, range 2.1–6.9 years). The correlations of the scores range from
0.42 for Alzheimer’s disease to 0.71 for diabetes and fall in the middle
of the range of correlations for individual biomarkers (e.g., amino acids
~0.3, HDL cholesterol ~0.8) (Supplementary Data 7, Fig. S11).
For eight diseases (myocardial infarction, ischemic stroke, diabetes, COPD, depression, colorectal cancer, lung cancer, and vascular
and other dementias) at least 100 events occurred within 10 years of
Nature Communications | (2024)15:10092
5
Article
of body mass index (BMI), but those who changed from high to low risk
lost an average of 0.81 units of BMI (difference of 0.99, 95% CI
0.78–1.20, linear regression t(df = 1348) = 9.22, p = 1.08 × 10−19). Among
self-reported smokers who were in the high-risk group for COPD at the
first time point, 64% of those who continued smoking remained at high
risk, compared to just 40% of those who reported quitting between the
two time points (Fisher’s exact test OR = 2.72, 95% CI = 2.47–6.68,
p = 0.0055). However, these explained only a few percent of the
observed metabolomic score changes, demonstrating that the scores
integrate a wider range of information than questionnaires.
Comparing multi-omics to existing clinical risk scores
We next compared our multi-omics predictions to published clinical
risk scores both in terms of hazard ratios for the top decile and
population-wide AUC. For all our diseases except Alzheimer’s we
identified scores that are recommended for use either by the National
Health Service in England and Wales (NHS) or by professional bodies in
the UK, EU, or USA, and calculated them as accurately as possible using
available variables in the UK Biobank (Methods). These scores vary in
the types of variables they include (e.g., QRISK for cardiovascular
disease includes several blood and anthropometric measurements,
whereas PHQ2 for depression is based solely on two self-reported
questions). The multi-omics scores perform significantly better for 10year risk prediction of myocardial infarction, the two liver diseases,
and colon cancer, while the clinical scores perform significantly better
for lung cancer, diabetes, and depression, and the remaining four
either have no difference, or inconsistent results between AUC and
hazard ratio of the top decile (Fig. 4A, Table 2). For all diseases except
intracerebral hemorrhage (which is not well predicted beyond age and
sex by any score we tested) a combined clinical+multi-omic score has
significantly higher AUC than the clinical score, with increases ranging
from 0.006 for lung cancer to 0.118 for alcoholic liver disease (Table 2).
These results were also consistent for four years of follow-up (Supplementary Data 8, Fig. S12).
To illustrate how the multi-omics score could augment the most
widely used risk screening tool in the UK, we further focused on
comparing QRISK to QRISK+multi-omics in individuals not using statins at baseline, which approximates the eligible group for a common
use of these scores in prioritizing patients for statin treatment. The
AUCs improve by 0.029 for myocardial infarction and 0.008 for
stroke, which are equivalent to the whole-population values in Table 2
(i.e., there is no interaction between these scores and statin usage).
Considering net reclassification index, the continuous changes (i.e.,
net fraction of individuals whose risk score moves in the right direction) are substantial: 0.21 myocardial infarction events (NRI+) and 0.31
for MI non-events (NRI–) and 0.02 and 0.18 for stroke. Considered
categorically (i.e., moving in the right direction between high and lowrisk groups) they are NRI+ = 0.09 and NRI– = −0.01 for MI and
NRI+ = 0.02 and NRI– = 0.00 for stroke.
Finally, we compared our high-risk groups to the remainder of the
population using the Frailty Index20,21 as a surrogate for an overall
impression of the health of an individual. The high-risk group has
slightly higher frailty index values (Fig. S13) and generally different
clinical characteristics (Supplementary Data 9).
Calibration across studies from different countries
Risk prediction models should be evaluated based on both discrimination and calibration22. We therefore tested the calibration of
our metabolomic scores by plotting observed event rates against
predicted absolute event rates per decile in all three biobanks (Fig. 4B).
We estimated calibration slopes and intercepts by fitting a logistic
regression model of observed risk on predicted risk, without any
study-specific processing or normalization, mimicking real-world
patient usage (Methods). For the main calibration analysis, we included diseases with >200 events over 3 years, as recommended by
Nature Communications | (2024)15:10092
https://doi.org/10.1038/s41467-024-54357-0
earlier studies23,24. Calibration results for the remaining diseases are
shown in Fig. S14.
Overall, the metabolomic scores demonstrated good calibration.
In the UK Biobank test set the calibration slopes ranged from 0.95 to
1.24 across diseases, as expected since the models were trained in the
other half of this biobank. In the EBB, the calibration slopes ranged
from 0.76 to 1.16, except for depression at 0.42. This difference is likely
a result of diagnostic differences in depression in different countries,
as well as how those diagnoses are encoded in electronic records. In
the Finnish THL Biobank the slopes were 1.03 (ischemic stroke), 1.20
(myocardial infarction) and 1.21 (diabetes), though the absolute rates
varied considerably, likely reflecting different rates of these diseases in
the earlier recruitment waves of these cohorts.
Discussion
We have shown that metabolomic scores can identify individuals at
increased risk across a range of diseases, consistent with a previous
report that analyzed a subset of the data described here10. We have
replicated these findings in two additional national biobanks and
demonstrated consistent performance in three countries which have
varying sample collection types (e.g., plasma vs. serum), enrollment
criteria, fasting protocols, and electronic health record systems. As the
biomarkers that constitute these scores are measured in absolute
units, the scores can be computed without cohort-specific rescaling,
which could aid in clinical translation of the scores. More than 1 in 4
individuals in these biobanks are in the high-risk group for at least one
of the cardiovascular diseases, lung diseases, liver diseases or diabetes,
where the high-risk group has at least 2.5-fold increase in risk (likely an
underestimate of population levels, due to healthy volunteer bias).
Our direct comparison, and combination, of metabolomic and
polygenic risk factors suggests the value of multi-omic scores. When
good predictors exist from both ‘omics data types (e.g., myocardial
infarction, diabetes, colorectal cancer) the scores are complementary
and together provide an improved combination of predictive accuracy
and specificity. Combining these forms of information may also be
useful in maximizing the predictive accuracy while avoiding the perceived determinism25 of fixed genetic predictions: our data suggests
both that genetic factors contribute stable long-term predictive
information and that these can be outweighed, particularly in the
shorter term, by detectable differences in metabolomic risk profile
(driven, in part, by factors modifiable by lifestyle and treatment
changes). Our analysis of follow-up samples underscores this: the
scores show an intermediate level of stability five years apart, meaning
they provide long-term risk prediction and can also track measurable
change in risk in response to lifestyle changes or treatment. The
observational nature of this dataset, as well potential survivor bias in
the individuals who participated in the repeat sampling, limit our
ability to make causal inferences about associations between changes
in lifestyle factors and changes in scores. Future studies that explicitly
measure metabolomics before and after an intervention are needed to
fully explore how metabolic risk scores track changes in
modifiable risk.
For 11 of the diseases we studied, we compared to existing scores
used for risk prediction or screening. We identified clear examples
where multi-omic scores alone outperformed these clinical scores,
including for myocardial infarction and liver diseases. In other diseases
the clinical scores perform better, often for understandable reasons.
For instance, for type 2 diabetes the QDiabetes score we used includes
Hb1Ac, the gold-standard diagnostic biomarker for the disease. For
lung cancer, the clinical score directly includes the vital causal risk
factor of smoking, and yet the multi-omic score still provides significant additional information: if we restrict to current smokers, the
multi-omic score outperforms the clinical score for lung cancer
(hazard ratio of 2.2 vs 1.8). Multi-omic scores provide statistically significant improvements when added on top of existing scores in all but
6
Article
https://doi.org/10.1038/s41467-024-54357-0
A
Myocardial infarction
(QRISK3)
Multi−omics + clinical
Multi−omics
Clinical
Ischemic stroke
(QRISK3)
Intracerebral hemorrhage
(QRISK3)
Lung cancer
(LLP v2 without asbestos)
Type 2 diabetes
(QDiabetes)
Chronic obstructive pulmonary disease
(COPD−PS)
Vascular and other dementia
(QRISK3)
Depressive disorders
(PHQ−2)
Alcoholic liver disease
(AUDIT−C)
Cirrhosis of the liver
(FLI)
Colon and rectum cancers
(QCancer)
1
B
3
5
10
Hazard ratio (95% CI), highest risk decile vs. remaining population
30
UK Biobank
Estonian Biobank
THL Biobank, pooled
3
2
1
0
1
2
3
0
1
2
3
0
1
2
0
2
4
6
0
2
4
0.5
0.0
0.0
0.5
1.0
0.0
0.5
1.0
Chronic obstructive pulmonary disease
Observed event rate (%)
Predicted event rate (%)
Myocardial infarction
0
3
3
1.0
6
1
2
2
Predicted event rate (%)
2
1
1
Colon and rectum cancers
3
0
3
Type 2 diabetes
0
6
2
Predicted event rate (%)
2
4
1
1
4
2
2
3
6
0
3
Predicted event rate (%)
Observed event rate (%)
Observed event rate (%)
0
Observed event rate (%)
Depressive disorders
Observed event rate (%)
Observed event rate (%)
Ischemic stroke
0
1
2
3
0
1
2
Predicted event rate (%)
Fig. 4 | Predictive ability and calibration of models including clinical, polygenic
and/or metabolomic scores. A Ten-year hazard ratios of models with clinical
variables compared to those variables plus the best ‘omic data (either just metabolomics or metabolomics plus PGS from Fig. 2, n = 241,750). Dots represent Cox
regression estimates and horizontal error bars denote 95% confidence intervals of
the hazard ratio. B For each disease, the calibration of three-year observed event
Nature Communications | (2024)15:10092
3
2
1
0
0
1
2
0
1
2
Predicted event rate (%)
rates are shown by 10 equally sized deciles of absolute risk predicted by a metabolomic score adjusted for age and sex (n = 415,592). Dots represent means and
vertical lines represent 95% confidence intervals of the observed event rate. Calibration slopes and intercepts were derived from a logistic regression of the
observed risk on the predicted risk. Source data are provided as a Source Data file.
7
Nature Communications | (2024)15:10092
0.75 (0.72–0.77)
0.75 (0.72–0.77)
0.68 (0.67–0.69)
Alcoholic liver disease (AUDIT-C)
Cirrhosis of the liver (FLI)
Colon and rectum cancers (QCancer)
0.69 (0.68–0.70)
0.01 (p = 1.3e–06)
0.62 (0.60–0.64)
−0.12 (p = 5.4e–19)
0.57 (0.56–0.57)
−0.10 (p = 6.5e–115)
0.66 (0.65–0.66)
−0.21 (p < 5e–324)
0.70 (0.69–0.72)
−0.12 (p = 1.4e–85)
0.70 (0.68–0.72)
−0.01 (p = 5.0e–05)
0.71 (0.70–0.72)
−0.03 (p = 7.9e–29)
0.68 (0.67–0.69)
−0.003 (p = 2.6e–01)
0.80 (0.78–0.82)
0.05 (p = 3.1e–05)
0.87 (0.85–0.88)
0.12 (p = 5.1e–19)
0.61 (0.60–0.62)
−0.06 (p = 1.9e–39)
0.82 (0.81–0.83)
0.000 (p = 0.84)
0.82 (0.81–0.83)
0.76 (0.75–0.77)
−0.05 (p = 1.7e–19)
0.81 (0.80–0.81)
−0.06 (p = 4.3e–102)
0.75 (0.74–0.76)
−0.08 (p = 2.8e–42)
0.71 (0.69–0.72)
−0.008 (p = 0.041)
0.73 (0.72–0.73)
−0.01 (p = 3.7e–08)
0.74 (0.74–0.75)
−0.004 (p = 6.9e–02)
Metabolomics
0.70 (0.69–0.71)
0.02 (p = 1.3e–09)
0.80 (0.78–0.82)
0.05 (p = 2.1e–05)
0.61 (0.61–0.62)
−0.05 (p = 7.5e–33)
0.83 (0.82–0.84)
0.77 (0.76–0.78)
−0.04 (p = 6.7e–12)
0.82 (0.81–0.82)
−0.05 (p = 9.6e–78)
0.75 (0.74–0.77)
−0.07 (p = 2.1e–36)
0.71 (0.69–0.72)
−0.008 (p = 0.034)
0.73 (0.72–0.74)
−0.01 (p = 7.6e–05)
0.76 (0.76–0.77)
0.01 (p = 3.6e–08)
PGS + Metabolomics
Clinical + PGS
0.70 (0.69–0.71)
0.02 (p = 6.3e–20)
0.75 (0.73–0.77)
0.004 (p = 0.11)
0.67 (0.67–0.68)
0.006 (p = 3.1e–05)
0.81 (0.80–0.82)
0.006 (p = 9.9e–07)
0.87 (0.87–0.88)
0.003 (p = 2.7e–17)
0.83 (0.82–0.84)
0.004 (p = 2.7e–04)
0.71 (0.70–0.73)
0.000 (p = 0.36)
0.74 (0.73–0.75)
0.003 (p = 5.9e–05)
0.77 (0.76–0.77)
0.02 (p = 2.0e–27)
0.68 (0.67–0.69)
0.004 (p = 8.7e–03)
0.84 (0.82–0.86)
0.10 (p = 2.5e–23)
0.86 (0.84–0.88)
0.12 (p = 2.4e–24)
0.68 (0.68–0.69)
0.02 (p = 5.2e–12)
0.82 (0.81–0.83)
0.006 (p = 8.5e–04)
0.83 (0.82–0.84)
0.02 (p = 2.0e–21)
0.89 (0.88–0.89)
0.02 (p = 2.4e–69)
0.83 (0.82–0.84)
0.004 (p = 0.017)
0.72 (0.70–0.74) 0.004 (p = 0.11)
0.75 (0.74–0.75)
0.008 (p = 1.8e–07)
0.76 (0.76–0.77)
0.01 (p = 5.9e–26)
Clinical + Metabolomics
0.70 (0.69–0.71)
0.03 (p = 2.0e–19)
0.84 (0.82–0.86)
0.10 (p = 1.9e–23)
0.69 (0.68–0.69)
0.02 (p = 9.2e–15)
0.83 (0.82–0.84)
0.03 (p = 2.7e–27)
0.89 (0.88–0.89)
0.02 (p = 2.5e–84)
0.83 (0.82–0.84)
0.006 (p = 3.5e–04)
0.72 (0.70–0.74)
0.004 (p = 0.16)
0.75 (0.74–0.76)
0.01 (p = 9.6e–10)
0.78 (0.77–0.78)
0.03 (p = 2.6e–51)
Clinical + ‘omics
First row shows the area under receiver operating curve (95% Confidence Interval) for the scores. Second row shows AUC difference in comparison to the clinical score (p-value). Two-tailed p-values were calculated using DeLong’s method, not adjusting for
multiple testing.
0.67 (0.66–0.68)
Depressive disorders (PHQ-2)
0.82 (0.81–0.84)
0.80 (0.80–0.81)
COPD (COPD-PS)
–
0.87 (0.86–0.87)
Type 2 diabetes (QDiabetes)
0.82 (0.81–0.83)
0.82 (0.81–0.84)
Lung cancer (LLP v2 without asbestos)
Alzheimer’s disease (NA)
0.72 (0.70–0.73)
Intracerebral hemorrhage (QRISK3)
Vascular and other dementia (QRISK3)
0.70 (0.69–0.72)
−0.10 (p = 1.4e–81)
0.74 (0.73–0.75)
Ischemic stroke (QRISK3)
0.73 (0.72–0.74)
−0.02 (p = 1.8e–11)
0.75 (0.74–0.76)
Myocardial infarction (QRISK3)
PGS
Clinical
Disease
Table 2 | Comparison of area under receiver operating curves in clinical and ‘omics scores in UK Biobank over ten years of follow-up
Article
https://doi.org/10.1038/s41467-024-54357-0
8
Article
one disease. Future work will be needed to quantify the extent to which
these statistically significant improvements in prediction, when used
to guide health interventions, could translate into improvements in
population health, as has been recently studied in the specific case of
cardiovascular disease26. Furthermore, it will be important to consider
whether and how multi-omic data fit in the diverse clinical contexts
represented by these scores. QRISK3 and QDiabetes are mainly tools
for identifying high-risk individuals who would benefit from primary
prevention efforts, COPD-PS, PHQ2, AUDIT and Fatty Liver Index (FLI)
are screening tools for prioritizing further investigation in individuals
who may have undiagnosed disease, and QCancer and Liverpool Lung
Project (LLP) scores are used for both primary prevention and to aid
early detection. In addition to accuracy, our results also allow us to test
calibration across populations, which is vital to future applied use.
While imperfect, the calibration of our scores is comparable to widely
used tools like the pooled cohort equations for cardiovascular risk
when compared, for example between the US and Canada27.
Our study has several limitations which we could only partially
mitigate. First, although biobanks are powerful for studying many
diseases simultaneously, they typically have less detailed and wellcurated phenotypes than disease-specific clinical cohorts. For example, we use endpoint definitions based on broad categories of ICD-10
codes. In the case of cardiovascular disease, we have shown using more
fine-grained coding does not substantially alter our conclusions, but
clinically focused collections will help bring additional resolution.
Furthermore, biobanks are known to have healthy volunteer bias,
which can affect comparisons between these results and true rates of
disease in the overall population. Second, we here defined those
groups as the top 10% of each disease’s risk score, but in practice
different cut points may be appropriate for different diseases (e.g., in
liver disease, risk is strongly concentrated in the top 1%), or cut-offs
may not be necessary at all if the goal is to deliver improved prediction
across the full range of baseline risk in the population. Third, these
three biobanks are nearly all of European ancestry, and while we
showed some promising results in the non-European ancestry subset
of UK Biobank, analyses of more diverse cohorts will be essential.
Finally, in comparing against clinical screening there are certain variables not measured in these biobanks that preclude a complete comparison, including of some widely used low-cost tests (e.g., fecal occult
blood samples in colon cancer) and more intensive screening tools
(e.g., CT screening for lung cancer or colonoscopies for colon cancer).
Many healthcare systems desire a more personalized and preventative model of disease management, including longitudinal
monitoring of risk, early detection of disease, and active, patientcentered management of risk factors. For this reason, it will be valuable
to understand how classical risk factors can be supplemented or
replaced with newer predictors including metabolomics and genetics,
and how and why these risk predictions change over time. We believe
our results, and the extremely large dataset that underlies them, are an
important part of this puzzle. Future work will need to focus on how we
may incorporate additional new predictors to stratify disease risk
further (e.g., recent work on proteomics has shown promise28), as well
as how this next generation of prediction algorithms, described here
for research use, can be validated, made actionable and delivered to
patients.
Methods
Study populations and endpoint definitions
We used data from a total of 700,217 individuals from three biobanks,
UK Biobank (N = 477,078), EBB (N = 190,785) and Finnish THL Biobank
(N = 32,354). Figure S1 shows an overview of our study design and
Table 1 shows summaries of participant characteristics.
The UK Biobank is a longitudinal biomedical study of approximately half a million participants between 38 and 71 years old from the
United Kingdom29. Participant recruitment was conducted on a
Nature Communications | (2024)15:10092
https://doi.org/10.1038/s41467-024-54357-0
volunteer basis and took place between 2006 and 2010. Initial data
were collected in 22 different assessment centers throughout Scotland, England, and Wales. Data collection includes elaborate genotype,
environmental and lifestyle data. Blood samples were drawn at baseline for all participants, with an average of four hours since the last
meal, i.e., generally non-fasting. Nuclear magnetic resonance (NMR)
metabolomic biomarkers (Nightingale Health, quantification library
2020) were measured from EDTA plasma samples (100 μL, aliquot 3)
during 2019–2023 for the entire cohort. In addition, plasma samples
were measured by NMR metabolomics from ~20,000 participants who
underwent a repeat-visit assessment on average five years after the
baseline visit. The NMR protocol is known to perform well on blind
duplicate samples in the UK Biobank8. Follow-up data include a wide
range of electronic health-related records, including disease incidence,
hospital admissions, primary care, and death records, which are presently still regularly updated. The UK Biobank study was approved by
the North West Multi-Centre Research Ethics Committee. This research
was conducted using the UK Biobank Resource under Application
Number 30418.
The EBB is a curated population-based biobank of Estonia, comprising a cohort of ~210,000 volunteers30. The participants constitute
about 20% of the adult Estonian population and the cohort is
approximately representative of the nation in terms of age, sex, and
geographic dispersion. The enrollment was conducted between 2002
and 2022. A network comprising general practitioners and various
medical personnel from private practices, hospitals, and recruitment
offices of the Estonian Genome Center was established for participant
recruitment, as well as for collection of samples and health data. After
recruitment, participants were asked to fill out detailed questionnaires, which encompassed personal information, genealogical
data, educational and occupational history, as well as lifestyle habits.
Blood samples were generally collected non-fasting31. NMR metabolomic measurements were conducted on EDTA plasma samples
(100 μL) for all biobank participants. The EBB database undergoes
regular synchronization with several national registries and hospital
databases, along with the national health insurance fund’s database
that houses comprehensive treatment and service bill information.
Disease events are codified in compliance with the ICD-10 standards
and medication usage is categorized as per the Anatomical Therapeutic Chemical classification, both with current follow-up data
available until the end of 2021. The Estonian Committee on Bioethics
and Human Research approved the study. Data was accessed with
research approval number 1.1-12/2770.
The Finnish THL biobank data consist of five population cohorts
(National FINRISK Studies 1997, 2002, 2007, 2012 and Health 2000
Survey) collected in study specific years between 1997 and 201232,33.
Each of the five cohorts is an independent random sample drawn of
unique individuals aged 25–98 (25–74 in FINRISK, 30 and over in
Health 2000). Recruitment was conducted via invitation-only in multiple urban and rural areas across Finland to be representative of the
nation (participation rate 60–70%). The baseline surveys included a
wide range of health-related questionnaire and biological measures,
including a non-fasting blood sample (median 5 h since last meal) from
~85% of all participants enrolled. NMR metabolomic data were measured from all participants with blood samples available. In contrast to
UK and EBBs, NMR metabolomics measurements were done on serum
samples (350 μL)8,32. Information on disease outcomes were linked
from national hospital discharge registries and reimbursement records
with follow-up until 2017 (4 to 19 years of follow-up). The THL biobank
cohorts were approved by the Coordinating Ethical Committee of the
Helsinki and Uusimaa Hospital District, Finland. Data was accessed
with research application number BB2016_86.
In our main analysis, we included 12 diseases that are the top
causes of disease-adjusted life years (DALYs) in the European region in
2019 according to the WHO, except falls and back pain34: myocardial
9
Article
infarction, ischemic stroke, intracerebral hemorrhage, lung cancer,
type 2 diabetes, COPD, Alzheimer disease, vascular and other
dementias, depressive disorders, alcoholic liver disease, cirrhosis of
the liver, and colon and rectum cancers. Some WHO groupings were
coarse (e.g., all dementias together), so we used narrower definitions in
those cases to create more biologically homogenous outcomes
aligned with common disease definitions used for PGS development.
Instead of using ischemic heart disease and diabetes mellitus, we
specified to myocardial infarction and type 2 diabetes, stroke was split
into ischemic stroke and intracerebral hemorrhage, Alzheimer’s disease was separated from vascular and other dementias, and alcoholic
liver disease was separated from cirrhosis of the liver (Supplementary
Data 2). We also included a model for the coarser ischemic heart disease (ICD-10 codes I20-25) endpoint, which we compared to the
myocardial infarction model in a supplementary analysis.
Disease incidence was defined based on the first occurrence of
ICD-10 codes listed in Supplementary Data 2. In the UK Biobank, we
based disease incidence on primary care data (only available in ~45% of
UK Biobank participants), hospital inpatient data, death register
records, cancer registry data, and self-reports at baseline. In the EBB,
we based disease incidence on self-reports at baseline, E-Health, North
Estonia Medical Center, Tartu University Hospital, death registry
records and cancer registry data. The Estonian Health Insurance Fund
was excluded as a source because it appeared that diagnoses exclusively from that source were less severe. In the Finnish THL Biobank,
we based disease incidence on the nation-wide hospital discharge
registry (HILMO), cause-of-death registry, and for certain diseases,
medication reimbursement registry. We considered disease cases
occurring after the blood draw at study baseline as incidence cases.
Cases that occurred prior to the baseline blood draw were considered
prevalent cases and were excluded from the analysis in a diseasespecific manner.
Analyses in this paper were all carried out conditional on age and
sex. Sex was defined differently in the different cohorts: in UK Biobank,
sex at recruitment is taken from the patient’s medical record but can
subsequently be edited by the participant if they choose. For the Finnish THL biobank, sex was taken from the population registries held by
the Digital and Population Data Services Agency. In the EBB, sex was
extracted from the Estonian National Identity number which is created
based on the sex recorded in the Estonian Birth Registry.
Written informed consent was obtained from all participants.
Participants were not offered compensation for participating in
this study.
Metabolomic biomarker profiling
Lipid and metabolite biomarkers were quantified from 757,927 blood
samples by high-throughput NMR metabolomics (Nightingale Health
Plc). This number includes 23,080 blinded duplicate samples from the
UK Biobank for quality control purposes. Additionally, each 96 wellplate contained two internal control samples. We used the sample
handling and measurement protocol established and validated for the
first phase of metabolomics in UK Biobank8. The measurement protocol was similar for EBB and Finnish THL Biobank32. Briefly, EDTA
samples of at least 90 μL were plated onto 96-well plates at UK Biobank
laboratory (Stockport, UK) and shipped on dry ice in batches of
5000–20,000 samples to Nightingale Health Laboratories in Finland.
Samples were provided to the Nightingale Health lab using randomly
assigned pseudonymous sample IDs and were analysed in three tranches blinded to the clinical phenotype of the sample, with UK Biobank
only revealing the linkage from pseudonymous sample IDs to true
sample IDs after data generated for that tranche was complete. Samples were thawed overnight at +4 °C, mixed and centrifuged, transferred to NMR tubes and mixed in 1:1 ratio with a phosphate buffer
(75 mM Na2HPO4 in 80%/20% H2O/D2O, pH 7.4, including also 0.08%
sodium 3-(trimethylsilyl) propionate-2,2,3,3-d4 and 0.04% sodium
Nature Communications | (2024)15:10092
https://doi.org/10.1038/s41467-024-54357-0
azide). The samples were profiled using a total of nine 500 MHz
spectrometers (Bruker AVANCE IIIHD). Two NMR spectra are recorded
(a presaturated proton spectrum and a Carr–Purcell–Meiboom–Gill
T2-relaxation-filtered spectrum) and proprietary software is used for
biomarker quantification in absolute units (Nightingale Health, quantification library 2020)7,8. This provides 249 biomarker measures in a
single assay (168 absolute and 81 ratio measures), including routine
lipids, lipoprotein profiling of 14 size subclasses, fatty acids, and various low-molecular weight metabolites, such as amino acids, ketones,
and glycolysis metabolites as well as two inflammatory protein measures, albumin, and glycoprotein acetyls. For risk model training we
used 36 clinically validated biomarkers with CE mark in the NMR
metabolomics assay to facilitate rapid translation and clinical applications for model training (Total cholesterol, VLDL cholesterol, Clinical LDL cholesterol, HDL cholesterol, Total triglycerides,
Apolipoprotein B, Apolipoprotein A1, Ratio of apolipoprotein B to
apolipoprotein A1, Total fatty acids, Omega-3 fatty acids, Omega-6
fatty acids, Polyunsaturated fatty acids, Monounsaturated fatty acids,
Saturated fatty acids, Docosahexaenoic acid, Ratio of omega-3 fatty
acids to total fatty acids, Ratio of omega-6 fatty acids to total fatty
acids, Ratio of polyunsaturated fatty acids to total fatty acids, Ratio of
monounsaturated fatty acids to total fatty acids, Ratio of saturated
fatty acids to total fatty acids, Ratio of docosahexaenoic acid to total
fatty acids, Ratio of polyunsaturated fatty acids to monounsaturated
fatty acids, Ratio of omega-6 fatty acids to omega-3 fatty acids, Alanine,
Glycine, Histidine, Total concentration of branched-chain amino acids,
Isoleucine, Leucine, Valine, Phenylalanine, Tyrosine, Glucose, Creatinine, Albumin, Glycoprotein acetyls)8.
To account for potential glucose degradation prior to plasma
sample preparation, we used an estimate of physiological glucose
concentration based on observed glucose and lactate as input for
the risk models. Further, to correct for spectrometer differences in
alanine concentration, the single metabolite most impacted by technical variation35, we shifted mean alanine concentrations observed
within each spectrometer in each biobank to the mean and standard
deviation of a master spectrometer. The average biomarker detection
rate was >99% across the plasma samples. Further details on the individual biomarker measures are provided in the UK Biobank data
resource.
Genotype data and polygenic scores
For this study, genotype data was available for UK Biobank and EBB,
but not for THL Biobank. The UK Biobank participant have been genotyped on Applied Biosystems UK Biobank Axiom Array and UK BiLEVE
Axiom Array, measuring over 800,000 variants and imputed using the
Haplotype Reference Consortium and UK10K and 1000 Genomes
reference panels outside of this study29. EBB participants have been
genotyped with genome-wide chip arrays and further imputed with a
population-specific imputation panel consisting of high-coverage (30fold) whole-genome sequence data from 2244 individuals and over 16
million high-quality genetic variants36.
For 10 of the 12 diseases, we used an existing PGS from the PGS
Catalog37 that was developed using GWAS summary statistics that did
not include the UK Biobank in their discovery cohort (Supplementary
Data 2). These PGS were computed for UK Biobank participants as the
weighted sum of risk alleles using imputed genotype data. We were
also able to compute six of these PGS in EBB, though we note that a
small number (~8000) of EBB samples were included in the GWAS
underlying the PGS of diabetes and myocardial infarction. For UK
Biobank, we estimated participant’s genetic ancestry with respect to
the five superpopulations of the 1000 Genomes Project38 using principal component analysis projection and a random forest classifier,
and scaled PGS with respect to their estimated ancestry. For EBB we
assumed more homogenous genetic background and scaled PGS
within the cohort.
10
Article
Clinical scores
A disease-specific clinical risk score was chosen for each endpoint to
act as a comparator for benchmarking our risk scores. The risk scores
were chosen such that they could be fully or partially calculated in UK
Biobank participants using data available at baseline, and were taken
from national risk assessment, screening and diagnosis guidance from
NHS England and Wales where available, and from recommendations
from other government or professional bodies where not available.
Two risk scores were modified to account for data not available in UK
Biobank at baseline (AUDIT-C and COPD-PS), and one was modified to
remove a question that showed informative missingness in our dataset
(LLPV2). We did not identify a widely used clinical risk score for Alzheimer’s disease, so did not carry out benchmarking for this endpoint.
Justifications for the choice of risk scores and details of their implementation are given in the Supplementary Methods document, and
details, including code lists, of how each variable is derived from UK
Biobank data is given in Supplementary Data 10. We defined high-risk
individuals on the basis of the clinical scores as those who were in the
top decile of risk, after adjusting for age and sex, in the training set. For
some clinical scores (AUDIT-C, COPD-PS and PHQ2) it was not possible
to define a top decile of risk, as these scores take discrete values and
therefore have exact ties that result in participants being on the border
of high and low risk; in these cases we randomly assigned individuals
with a borderline risk to the high risk or low category with fixed perscore probabilities, with these probabilities chosen to ensure that 10%
of individuals were assigned as high risk in the training set. During
model fitting variables were transformed to ensure a close to normal
distribution (log transformation for FLI, logit transformation for
QRISK3, QDiabetes, QCancer and LLPv2). We additionally assessed the
performance of the two lung and two liver models in the context of
smoking and drinking behavior. Lung cancer and COPD HR for the top
decile versus the bottom 90% were evaluated separately for ever and
never smokers, and adjusted for pack years of smoking. Liver cirrhosis
and alcoholic liver disease HR were evaluated separately for ever and
never drinkers, and adjusted for daily units of alcohol consumption.
We also computed correlations between the disease scores, pack years
and alcohol units.
Statistical analyses
Prior to epidemiological data analyses, we set any value of the metabolomic biomarkers or PGS to missing if the value was more ±4 standard deviations away from the mean. Samples with any missing
information (metabolomic biomarkers, PGS or clinical variables)
required for score training or prediction were excluded separately for
each analysis. Considering the 36 CE-marked metabolomic biomarkers, 11,676 samples (2.4%) had at least one biomarker missing, and
16,614 samples (3.5%) were excluded due to outlier filtering. Additional
analyses comparing performance within excluded samples showed
that exclusions had little effect on the results (Fig. S15). We did not
filter on genetic ancestry or ethnicity. Metabolomic biomarker measures were log1p-transformed, and all continuous variables were
Z-normalized to have a mean of zero and a standard deviation of one in
the training set. The means and standard deviations of the training set
were subsequently used to scale the metabolomic biomarkers in the
testing and replication cohorts.
We assigned all participants with a repeat measurement to the
testing set (to increase sample size for that analysis), and then split the
remaining UK Biobank data in half for training and testing, resulting in
a total of 241,246 individuals to train the risk scores. We used 10 years
of follow-up and Cox proportional hazards regression modeling with
least absolute shrinkage and selection operator (Lasso) and tenfold
cross-validation using the R package hdnom. We favored the parsimonious Lasso models for interpretability after observing no consistent pattern of better-performing prediction between Lasso and
Elastic Net (Fig. S16). For each disease, we trained nine models, which
Nature Communications | (2024)15:10092
https://doi.org/10.1038/s41467-024-54357-0
all contained (1) age and sex, and additionally included (2) metabolomic measures, (3) a disease-specific PGS, (4) metabolomic measures,
and a disease-specific PGS, (5) a disease-specific clinical score, (6) a
disease-specific clinical score and PGS, (7) a disease-specific clinical
score and metabolomic measures, (8) a disease-specific clinical score,
PGS and metabolomic measures, and (9) an extended set of metabolomic biomarkers. Age and sex were not penalized to ensure they were
selected for each model and appropriately weighted. The extended set
of metabolomic biomarkers contained all 249 available absolute and
ratio measures. For the genetic ancestry analyses, we trained another
set of models identical to the description above but leaving all participants with non-European genetic ancestry out from the training set
to maximize the size of this group in the testing set.
We computed risk scores (for research use only) as the weighted
sum of variables selected during training for each model. The variables
that were selected for each model during training and their respective
coefficients can be found in Supplementary Data 3. In computing the
risk scores, we excluded the sex and age coefficients, which for a
combined metabolomic and PGS model for example, effectively
results in risk scores comprised of the weighted sum of 1–36 metabolomic measures and a PGS, which have been adjusted for age and
sex. We evaluated correlations of the scores by computing Pearson
correlation coefficients between different endpoints by model type,
between duplicate measurements and between baseline and repeat
assessments.
We tested risk model performance in all three biobanks by
assessing the association of these age- and sex-adjusted scores with
disease incidence within the first 4 years (as the EBB has a large number
of samples limited to this amount of follow-up) after the blood draw
using Cox proportional hazards models and examining the hazard
ratios (HR) with 95% confidence intervals between individuals in the
top decile of the score and individuals at the bottom 90%. Fixed effect
meta-analysis of estimated hazard ratios across the three biobanks was
carried out with the R package meta. We calculated the limits for the
top deciles for each model in the training set once and subsequently
used these to determine top decile classification in validation and
replication cohorts (Supplementary Data 4). We assessed HR for 4 and
10 years of follow-up in the UK Biobank for PGS (also including results
from the EBB) and disease-specific clinical risk score comparisons. We
additionally stratified HR by the source of report of disease incidence
(Fig. S17) and whether the individual had primary care data available
(Fig. S18). In addition, we computed HR per one standard deviation
increments in the age- and sex-adjusted scores using Cox proportional
hazards models. We compared these results to another version in
which biomarker scaling and decile limits were not determined by the
training set, but calculated within each separate cohort (Fig. S4). For
statistical significance, we considered p-values < 0.004 corresponding
to a 95% confidence level Bonferroni corrected for 12 diseases39,40. We
estimated Kaplan–Meier curves stratified by PGS and metabolomic
score deciles for 10 years of follow-up using R package survival41 for
individuals in the top decile of the PGS score and the bottom 90% of
the metabolomic score, the top decile of both the PGS and metabolomic scores, and the bottom 90% of the PGS score. We used the
cox.zph function to examine the proportionality of hazards assumption and Schoenfeld residual plots, which revealed that the HR was not
constant over the 10-year follow-up time for some disease endpoints.
Therefore, we assessed the continuous hazard over the HR in strata
across the follow-up time for the metabolomics and PGS models. Using
the R package rstpm242 we built a generalized survival model with
natural splines using 2 knots to allow for a time-varying effect, and
additionally computed the hazard ratio for 1-year strata using the
survSplit option in the survival R package41. We also tested for interaction effects between the metabolomic scores and PGS. Area under
receiver operating curves (AUC) were estimated utilizing absolute risks
using the R package pROC43. Net reclassification improvements (NRI)
11
Article
were computed comparing metabolomic and PGS scores to diseasespecific clinical scores. Both continuous and categorical scores were
calculated, with categorical scores using limits of top 10% of the scores
as cut-off thresholds. The R package nricens44 was utilized in NRI
computation.
For analyses examining two time points, we had 18,709 UKB
participants with metabolomic scores calculated from both baseline
and a repeat visit which took place after 2 to 7 years. We considered the
eight diseases that had at least 100 cases within 10 years after the
repeat visit (349 cases for COPD, 214 for colon cancer, 288 for
depression, 439 for diabetes, 303 for myocardial infarction and 225
ischemic stroke). We fitted Cox proportional hazards models for disease events 10 years after the repeat visit with age- and sex-adjusted
metabolomic scores at baseline and repeat visit. Disease events
between baseline and repeat visit were excluded. To assesses the risk
changes between the baseline and repeat-visit, we categorized participants into three groups: those who stayed in the highest decile of
metabolomic score at both time-points, those were in the highest
decile of metabolomic score at baseline but left the high-risk category
at the repeat visit, and those in the bottom 90% of the metabolomic
score at baseline. We tested for a difference between the two groups
(stayers vs leavers) using Cox regression, including the baseline
metabolomic score as a covariate to control for the fact that stayers
had, on average, a higher baseline score than leavers. Additionally, we
analyzed 5038 participants with two separate blood metabolomics
measurement on average five years apart from EBB as above in the case
of diabetes, since this was the only disease with sufficient events to
assess replication.
We assessed clinical characteristics of high-risk individuals,
defined as those with a metabolomics model score in the highest
decile of at least one of the seven best-performing models: alcoholic
liver disease, COPD, cirrhosis of the liver, ischemic stroke, lung
cancer, myocardial infarction, and type 2 diabetes. In addition to
basic clinical characteristics, we evaluated the frailty index, a measure to quantify aging and health, between high- and low-risk individuals. We calculated the frailty index based on the method
described by Williams et al. 20, using 49 self-reported disease outcomes in UK Biobank participants. Participants with at least 10
missing items were excluded. We tested the difference in mean for
clinical characteristics between high- and low-risk individuals using a
two-sided t-test for continuous variables and a Chi-squared test for
categorical variables.
To evaluate calibration of the metabolomic scores, we estimated
observed and predicted incidence rates in all three biobanks over 3
years of follow-up. We chose to censor at three years to obtain complete and comparable follow-up for as many samples as possible. We
estimated calibration slopes and intercepts by fitting logistic regression of individual diseases status (observed risk) on predicted risk23,24.
We performed all statistical analyses and modeling in R version 4.3.245.
Reporting summary
https://doi.org/10.1038/s41467-024-54357-0
platform. The average number of weeks from application submission
to data release is 15 weeks for UK Biobank. Data from Estonia Biobank
can be accessed through a research application to Institute of Genomics of the University of Tartu (https://genomics.ut.ee/en/content/
estonian-biobank). Data from FINRISK and Health 2000 cohorts can be
accessed through a research application to THL Biobank (https://thl.fi/
en/web/thl-biobank). Source data for Figs. 1–4 are provided with this
paper. Source data are provided with this paper.
Code availability
Code to reproduce the figures in this paper is available at: https://
github.com/NightingaleHealth/ukb-nightingale-omics-prediction/.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Further information on research design is available in the Nature
Portfolio Reporting Summary linked to this article.
Data availability
For reasons of patient confidentiality and to ensure research is carried
out in accordance with the terms of consent the data were collected
under, data from the three biobanks are available under controlled
access. The UK Biobank data are available for approved researchers
through the UK Biobank data-access protocol (https://www.
ukbiobank.ac.uk/enable-your-research/apply-for-access). The data
from the first ~280,000 UK Biobank participants included in the Data
Showcase https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220,
and full dataset is available on the Resesarch Analysis Platform https://
www.ukbiobank.ac.uk/enable-your-research/research-analysis-
Nature Communications | (2024)15:10092
15.
16.
17.
18.
Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and
validation of QRISK3 risk prediction algorithms to estimate future
risk of cardiovascular disease: prospective cohort study. BMJ 357,
j2099 (2017).
Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of
polygenic risk scores. Hum. Mol. Genet. 28, R133–R142 (2019).
Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools
to clinical instruments. Genome Med. 12, 44 (2020).
Khera, A. V. et al. Genome-wide polygenic scores for common
diseases identify individuals with risk equivalent to monogenic
mutations. Nat. Genet. 50, 1219–1224 (2018).
Mars, N. et al. Polygenic and clinical risk scores and their impact on
age at onset and prediction of cardiometabolic diseases and
common cancers. Nat. Med. 26, 549–557 (2020).
Inouye, M. et al. Genomic risk prediction of coronary artery disease
in 480,000 adults. J. Am. Coll. Cardiol. 72, 1883–1893 (2018).
Soininen, P., Kangas, A. J., Würtz, P., Suna, T. & Ala-Korpela, M.
Quantitative serum nuclear magnetic resonance metabolomics in
cardiovascular epidemiology and genetics. Circulation: Cardiovasc.
Genet. 8, 192–206 (2015).
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and
disease in 118,461 individuals from the UK Biobank. Nat. Commun.
14, 604 (2023).
Würtz, P. et al. Quantitative serum nuclear magnetic resonance
metabolomics in large-scale epidemiology: a primer on -omic
technologies. Am. J. Epidemiol. 186, 1084–1096 (2017).
Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022).
Pietzner, M. et al. Plasma metabolites to profile pathways in noncommunicable disease multimorbidity. Nat. Med. 27, 471–479 (2021).
Morze, J. et al. Metabolomics and Type 2 Diabetes Risk: An Updated
Systematic Review and Meta-analysis of Prospective Cohort Studies. Diab. Care 45, 1013–1024 (2022).
Deelen, J. et al. A metabolic profile of all-cause mortality risk
identified in an observational study of 44,168 individuals. Nat.
Commun. 10, 3346 (2019).
Lauber, C. et al. Lipidomic risk scores are independent of polygenic
risk scores and can predict incidence of diabetes and cardiovascular disease in a large population cohort. PLoS Biol. 20,
e3001561 (2022).
Walford, G. A. et al. Metabolite traits and genetic risk provide
complementary information for the prediction of future type 2
diabetes. Diab. Care 37, 2508–2514 (2014).
Unterhuber, M. et al. Proteomics-enabled deep learning machine
algorithms can enhance prediction of mortality. J. Am. Coll. Cardiol.
78, 1621–1631 (2021).
Godbole, S. et al. A metabolomic severity score for airflow
obstruction and emphysema. Metabolites 12, 368 (2022).
Riveros-Mckay, F. et al. Integrated polygenic tool substantially
enhances coronary artery disease prediction. Circ. Genom. Precis.
Med. 14, e003304 (2021).
12
Article
19. Thompson, D. J. et al. A systematic evaluation of the performance
and properties of the UK Biobank Polygenic Risk Score (PRS)
Release. PLoS One 19, e0307270 (2024).
20. Williams, D. M., Jylhävä, J., Pedersen, N. L. & Hägg, S. A frailty index
for UK biobank participants. J. Gerontol. Ser. A 74, 582–587 (2019).
21. Mak, J. K. L. et al. Unraveling the metabolic underpinnings of frailty
using multicohort observational and Mendelian randomization analyses. Aging Cell e13868 https://doi.org/10.1111/acel.13868 (2023).
22. Alba, A. C. et al. Discrimination and calibration of clinical prediction
models: users’ guides to the medical literature. JAMA 318,
1377–1384 (2017).
23. Calster, B. V. et al. A calibration hierarchy for risk models was
defined: from utopia to empirical data. J. Clin. Epidemiol. 74,
167–176 (2016).
24. Collins, G. S., Ogundimu, E. O. & Altman, D. G. Sample size considerations for the external validation of a multivariable prognostic
model: a resampling study. Stat. Med. 35, 214–226 (2016).
25. Kullo, I. J. et al. Polygenic scores in biomedical research. Nat. Rev.
Genet. 23, 524–532 (2022).
26. Ritchie, S. C. et al. Cardiovascular risk prediction using metabolomic biomarkers and polygenic risk scores: A cohort study and
modelling analyses. Preprint at https://www.medrxiv.org/content/
10.1101/2023.10.31.23297859v1 (2023).
27. Ko, D. T. et al. Calibration and discrimination of the Framingham
risk score and the pooled cohort equations. CMAJ 192, E442–E449
(2020).
28. Gadd, D. A. et al. Blood protein assessment of leading incident
diseases and mortality in the UK Biobank. Nat Aging 4, 939–948
(2024).
29. Bycroft, C. et al. The UK Biobank resource with deep phenotyping
and genomic data. Nature 562, 203–209 (2018).
30. Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian
Genome Center, University of Tartu. Int. J. Epidemiol. 44,
1137–1147 (2015).
31. Fischer, K. et al. Biomarker profiling by nuclear magnetic
resonance spectroscopy for the prediction of all-cause mortality: an observational study of 17,345 persons. PLoS Med. 11,
e1001606 (2014).
32. Tikkanen, E. et al. Metabolic biomarker discovery for risk of peripheral artery disease compared with coronary artery disease:
lipoprotein and metabolite profiling of 31 657 individuals from 5
prospective cohorts. J. Am. Heart Assoc. 10, e021995 (2021).
33. Borodulin, K. et al. Cohort profile: the National FINRISK Study. Int. J.
Epidemiol. 47, 696–696i (2018).
34. World Health Organization. Global health estimates: Leading causes of DALYs. https://www.who.int/data/gho/data/themes/
mortality-and-global-health-estimates/global-health-estimatesleading-causes-of-dalys.
35. Ritchie, S. C. et al. Quality control and removal of technical variation
of NMR metabolic biomarker data in ~120,000 UK Biobank participants. Sci. Data 10, 64 (2023).
36. Mitt, M. et al. Improved imputation accuracy of rare and lowfrequency variants using population-specific high-coverage WGSbased imputation reference panel. Eur. J. Hum. Genet. 25,
869–876 (2017).
37. Lambert, S. A. et al. The Polygenic Score Catalog: new functionality
and tools to enable FAIR research. Preprint at https://www.medrxiv.
org/content/10.1101/2024.05.29.24307783v1 (2024).
38. Auton, A. et al. A global reference for human genetic variation.
Nature 526, 68–74 (2015).
39. Neyman, J. & Pearson, E. S. On the use and interpretation of certain
test criteria for purposes of statistical inference: part I. Biometrika
20A, 175–240 (1928).
Nature Communications | (2024)15:10092
https://doi.org/10.1038/s41467-024-54357-0
40. Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc.
56, 52–64 (1961).
41. Therneau T. A Package for Survival Analysis in R. R package version
3.7-0, https://CRAN.R-project.org/package=survival (2024).
42. Liu, X.-R., Pawitan, Y. & Clements, M. Parametric and penalized
generalized survival models. Stat. Methods Med. Res. 27,
1531–1546 (2018).
43. Robin, X. et al. pROC: an open-source package for R and S+ to
analyze and compare ROC curves. BMC Bioinforma. 12, 77
(2011).
44. Inoue, E. nricens: NRI for risk prediction models with time to event
and binary response dat. https://doi.org/10.32614/CRAN.package.
nricens
45. R Core Team. R: The R project for statistical computing. R foundation for statistical computing, https://www.R-project.org/ (Vienna,
Austria, 2022).
Acknowledgements
We acknowledge the lab and spectrometry teams at Nightingale Health
for their role in generating the metabolomic data on these cohorts. We
are grateful to UK Biobank (Project 30418), Estonian Biobank, and THL
Biobank (project BB2016_86) for access to data to undertake this study.
We thank all biobank participants for their generous contribution to
generating this resource for the scientific community. We acknowledge
the Estonian Biobank Research Team of Mari Nelis, Georgi Hudjasov,
Reedik Mägi, Andres Metspalu and Lili Milani. Additionally, we thank
Kristi Läll, Erik Abner and Kelli Lehto for their help with Estonian Biobank
phenotype data. The work was funded by Nightingale Health Plc. Estonian Biobank was supported by Estonian Research Council grant
PRG1291. TT was supported by Estonian Research Council grant
PSG809.
Author contributions
Conceptualization: J.C.B., T.E., H.J., P.W. Data Curation: H.J., N.K., S.K.,
S.N.L., K.S. Formal Analysis: L.J.-D., H.J., N.K., S.K., S.N.L., K.S. Methodology: H.J. Investigation: J.C.B., L.J.-D., H.J., K.H., N.K., S.K., A.K., S.N.L.,
V.M., K.N., K.S., M.S., P.S., M.T., P.W. Resources: T.E., P.J., T.J., J.K., A.L.,
M.P., V.S., T.T. Supervision: J.C.B., H.J., P.W. Visualization: L.J.-D., H.J.,
S.K., S.N.L., K.S. Writing – original draft: J.C.B. Writing – review & editing:
J.C.B., L.J.-D., H.J., S.K., S.N.L., N.K., K.S., P.W.
Competing interests
JCB, LJD, HJ, AK, NK, SK, HK, SL, VK, KN, KS, MS, PS, MT, and PW are
employees of and hold shares or stock options in, Nightingale Health.
The remaining authors declare no conflict of interest.
Additional information
Supplementary information The online version contains
supplementary material available at
https://doi.org/10.1038/s41467-024-54357-0.
Correspondence and requests for materials should be addressed to
Jeffrey C. Barrett.
Peer review information Nature Communications thanks Themistocles
Assimes, and the other, anonymous, reviewer(s) for their contribution to
the peer review of this work. A peer review file is available.
Reprints and permissions information is available at
http://www.nature.com/reprints
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
13
Article
https://doi.org/10.1038/s41467-024-54357-0
Open Access This article is licensed under a Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International License,
which permits any non-commercial use, sharing, distribution and
reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if you modified the licensed
material. You do not have permission under this licence to share adapted
material derived from this article or parts of it. The images or other third
party material in this article are included in the article’s Creative
Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons
licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly
from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by-nc-nd/4.0/.
© The Author(s) 2024
Nightingale Health Biobank Collaborative Group
Jeffrey C. Barrett 1 , Tõnu Esko2, Krista Fischer 2,3, Luke Jostins-Dean1, Pekka Jousilahti 4, Heli Julkunen 1,
Tuija Jääskeläinen4, Antti Kangas1, Nurlan Kerimov 1, Sini Kerminen 1, Anastassia Kolde 2,3, Harri Koskela1,
Jaanika Kronberg 2, Sara N. Lundgren1, Annamari Lundqvist4, Valtteri Mäkelä1, Kristian Nybo1, Markus Perola4,
Veikko Salomaa 4, Kirsten Schut1, Maiju Soikkeli1, Pasi Soininen1, Mika Tiainen1, Taavi Tillmann5 & Peter Würtz 1
1
Nightingale Health, Helsinki, Finland. 2Institute of Genomics, Faculty of Science and Technology, University of Tartu, Tartu, Estonia. 3Institute of Mathematics
and Statistics, Faculty of Science and Technology, University of Tartu, Tartu, Estonia. 4Department of Public Health, Finnish Institute for Health and Welfare,
Helsinki, Finland. 5Institute of Family Medicine and Public Health, University of Tartu, Tartu, Estonia.
e-mail: jeffrey.barrett@nightingalehealth.com
Nature Communications | (2024)15:10092
14
Publication V
Heli Julkunen, Juho Rousu. Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank.
Submitted, July 2024. Available on medRxiv.
177
Machine learning for comprehensive interaction modelling
improves disease risk prediction in the UK Biobank
Heli Julkunen1,* and Juho Rousu1
1
Department of Computer Science, Aalto University, Espoo, Finland
*
Corresponding author, heli.julkunen@aalto.fi
Abstract
Understanding how risk factors interact to jointly influence disease risk can provide insights
into disease development and improve risk prediction. We introduce survivalFM, a machine
learning extension to the widely used Cox proportional hazards model that incorporates estimation of all potential pairwise interaction effects on time-to-event outcomes. The method
relies on learning a low-rank factorized approximation of the interaction effects, hence overcoming the computational and statistical limitations of fitting these terms in models involving
many predictor variables. The resulting model is fully interpretable, providing access to the
estimates of both individual effects and the approximated interactions. Comprehensive evaluation of survivalFM using the UK Biobank dataset across ten disease examples and a variety
of clinical risk factors and omics data modalities shows improved discrimination and reclassification performance (65% and 97.5% of the scenarios tested, respectively). Considering
a clinical scenario of cardiovascular risk prediction using predictors from the established
QRISK3 model, we further show that the comprehensive interaction modelling adds predictive value beyond the individual and age interaction effects currently included. These results
demonstrate that comprehensive modelling of interactions can facilitate advanced insights
into disease development and improve risk predictions.
Introduction
Risk prediction models are needed in modern preventive medicine to identify individuals at high risk of
disease before clinical symptoms manifest. The ability to predict disease risk is particularly important
in managing complex diseases, such as cardiovascular disease, chronic kidney disease, and diabetes,
where early intervention can substantially alter patient outcomes. However, accurately predicting disease risk is challenging due to the inherent complexity of most human diseases, which arise from the
interplay of genetic, environmental, and lifestyle factors. Traditional methods in survival analysis,
such as the widely used Cox proportional hazards regression [1], assume linear effects of predictor
variables on time-to-event outcomes. This assumption may lead to oversimplified models that overlook the complex interplay among predictors, potentially missing important biological insights and
limiting risk prediction accuracy.
1
The accuracy of time-to-event prediction models can be improved by incorporating interaction
terms, a well-established concept in epidemiology to assess the joint effects of predictors on outcomes
[2, 3]. For instance, interaction terms have been shown to be relevant in cardiovascular disease (CVD)
risk prediction, where the effects of other risk factors can vary depending on age [4, 5, 6, 7]. However,
incorporating these terms in multivariable prediction models typically requires prior hypotheses about
which interactions to include. As the number of potential interaction terms increases quadratically
with the number of predictor variables in consideration, inclusion of all potential interactions quickly
becomes impractical without targeted hypotheses to guide the selection. Therefore, prior multivariable prediction models have typically been constrained to a restricted set of interaction terms known to
alter outcome associations, such as those involving age. This limits the discovery of new, potentially
relevant interactions. Another commonly employed strategy is to perform statistical testing of individual interaction terms, but this can miss interactions that only become relevant for prediction in the
presence of other variables. This challenge becomes particularly pronounced with modern biomedical
datasets, which can contain hundreds of potential predictors. While machine learning survival analysis extensions like random survival forests [8] and deep survival models [9, 10] can capture complex
non-linearities and interactions in the underlying data, they often compromise interpretability, which
is crucial when the goal is to inform clinical decision-making or to obtain insights into the risk factors
underlying disease development.
To enhance possibilities to understand and model the joint effects of risk factors on time-to-event
disease outcomes, we here present survivalFM, a methodological extension to the Cox proportional
hazards model that incorporates estimation of all potential pairwise interaction effects among predictor
variables. The method is based on an efficient strategy of learning the interaction effects using a
low-rank factorized approximation, a concept taken from factorization machines (FMs) [11] and here
applied to survival analysis. survivalFM combines the factorization of the interaction effects with an
efficient quasi-Newton optimization algorithm, thereby overcoming the computational and statistical
challenges of fitting comprehensive interaction effects in time-to-event prediction models involving
many variables. The resulting model is fully interpretable, providing access to the estimates of both
individual effects and the approximated interactions. We demonstrate the performance of survivalFM
across various data modalities and disease outcomes using data from the UK Biobank. We further
highlight an application in a clinical cardiovascular risk prediction scenario and show that survivalFM
can learn predictive interaction effects which improve identification of high-risk individuals. While
we highlight applications in disease risk prediction, the method is generally applicable to modelling
any type of time-to-event outcomes.
Results
Overview of survivalFM
Figure 1 presents an overview of survivalFM. We developed survivalFM to estimate all potential
pairwise interaction effects among input variables for right-censored survival data, such as time to
disease onset. It is based on the widely used proportional hazards model [1] which relates the time
until an event occurs to a set of predictor variables through a hazard function of the form:
h(t|x) = h0 (t) exp(f (x))
(1)
where h(t|x) represents the hazard for an individual at time point t, with the baseline hazard
function h0 (t) describing the time-varying hazard and the partial hazard exp(f (x)) quantifying the
2
impact of the predictor variables x on the baseline hazard. In the standard formulation of the Cox
proportional hazards model, the partial hazard exp(f (x)) is assumed to be parametrized by a linear
combination of the predictor variables f (x) = β x, with β giving the weights for the individual
variables.
In many applications, understanding how variables may interact to jointly impact the hazard rate
can provide additional value beyond their independent linear effects. However, directly fitting all
potential pairwise interaction terms in a multivariable prediction model quickly becomes challenging
due to the quadratic increase in the number of interaction terms as a function of the number of input
variables. Hence, we propose survivalFM, an extension which adds an approximation of all pairwise
interaction effects using a factorized parametrization approach (Figure 1a-b):
f (x) = β x +
pi , pj xi xj
(2)
β̃i,j xi xj = β x +
1≤i=j≤d
1≤i=j≤d
where ·, · denotes the inner product and d denotes the number of predictor variables. The first
part contains the linear effects of all predictor variables in the same way as in the standard formulation
of the Cox proportional hazards model. The second part contains all pairwise interaction effects
between the predictor variables xi and xj . However, instead of directly estimating the interaction
effects βi,j , the factorized parametrization approximates the effects using an inner product between
two low-rank latent vectors β̃i,j = pi , pj . The parameter vectors pi ∈ Rk and pj ∈ Rk are the
row vectors of a low-rank parameter matrix P ∈ Rd×k (Figure 1a). Hence, this results in much
fewer parameters to estimate, as the rank of the factorization is typically much lower than the total
number of predictor variables (k d). With this approach, we avoid the statistical and computational
problems that would be encountered with direct estimation of all interactions terms in the presence
of many predictor variables, while still maintaining interpretability. The idea of using factorized
parametrization strategy originates from factorization machines (FMs) [11], originally proposed for
regression and classification tasks in the context of recommender systems. For more details of the
model and the fitting procedure, see Methods.
Study population, disease outcomes and data modalities
To evaluate whether survivalFM could improve risk prediction models and provide new insights on the
joint effects of risk factors on disease onset, we performed analyses using data from the UK Biobank.
This cohort comprises a total of approximately 500,000 participants from the UK, enrolled in 21 recruitment centers across the country. The UK Biobank is renowned for its comprehensive phenotyping
and molecular profiling, including routine blood biomarkers and advanced ’omics measurements such
as genomics and metabolomics. Baseline characteristics of the study population and a summary of the
datasets studied here are summarized in Supplementary Table 1. As disease outcomes, we considered
the 10-year incidence of ten example diseases, selected to comprise common diseases and diseases
which can benefit from intervention if identified early (Supplementary Tables 2 and 3), excluding
participants with a prior record of the disease at baseline.
To assess the performance across different data modalities, we considered four different prediction
scenarios that incorporate an array of predictors ranging from traditional clinical predictors to more
advanced omics-based data sources (Figure 1c, Methods). In the first scenario, we started from a
set of standard cardiovascular risk factors included in the ASCVD risk estimator plus [12], widely
recognized in various primary prevention scores. Since these factors have been shown to be predictive
beyond cardiovascular diseases [13, 14, 15], we included them as standard risk factors across all
3
analyzed disease examples. We then added sets of more complex data layers to these standard risk
factors (Figure 1c). In the second scenario, we added a comprehensive set of hematologic and clinical
biochemistry measures to the standard risk factors; in the third scenario, we incorporated a wide
range of metabolomic biomarkers, recently shown promise as an assay to inform on multidisease risk
[13, 16]; and finally, we included a set of polygenic risk scores for both disease and quantitative traits
[17], which have gained interest for their potential to enhance risk prediction models by providing
complementary information to traditional risk factors [18, 19, 20].
survivalFM improves risk prediction across various diseases and data modalities
The practical utility of any risk prediction model is determined by its ability to stratify risk and identify high-risk individuals. We evaluated the ability of survivalFM to predict future disease risk and
benefit from the comprehensive interaction terms by comparing its performance to standard linear
Cox proportional hazards regression (Figure 1b), employing L2 (Ridge) regularization in both methods to control model complexity and prevent overfitting (Methods). The performance of the models
was evaluated in 10-fold cross-validation, using 20% validation set within each cross-validation cycle
to optimize regularization parameters. Analyses were consistently applied across the same sets of
predictor variables and fixed cross-validation folds.
By modelling the comprehensive interactions present in the underlying data, survivalFM improved
the discriminatory performance across a majority of the studied examples as measured by concordance
index (C-index; Figure 2). Specifically, statistically significant improvements were noted in 26 of
the 40 evaluated scenarios (65%), with a mean improvement in concordance index (ΔC-index) of
0.005. Minor improvements were noted in another 12 out of 40 (30%) of scenarios (mean ΔC-index
0.001). Importantly, none of the studied examples demonstrated a statistically significant decrease in
performance with survivalFM, highlighting the robustness of survivalFM. Absolute values for the Cindices are detailed in Supplementary Table 4, demonstrating good discriminative performance across
all models with C-indices in the range 0.72–0.93. Moreover, all models were well calibrated across
the UK Biobank cohort (Supplementary Figures 2-5).
Given that even modest improvements in the C-index at the population level can substantially
affect individual risk predictions, we also evaluated the model performance using continuous net reclassification improvement (NRI), which has been shown to provide complementary information on
risk model performance [21, 22]. The continuous NRI quantifies the extent to which the model appropriately increases the predicted probabilities for subjects who experience events and decreases them
for those who do not. This metric is particularly useful in the absence of established clinical thresholds
for high-risk groups, as it quantifies the improvement in risk prediction without relying on predefined
risk cutoffs and thus facilitates comparisons across different diseases.
In terms of the continuous NRI, survivalFM yielded significantly improved resclassification in
39 out of 40 (97.5%) of the studied examples, with a mean continuous NRI of 37%. Therefore,
despite the relatively modest improvement magnitudes in the C-indices, the continuous NRI indicated
notable positive changes in individual risk predictions. For instance, type 2 diabetes modelled using
clinical biochemistry and blood counts data demonstrated the highest continuous net reclassification
improvement of 94% (95% CI 92%-96%), corresponding to 33% (95% CI 32%-35%) of events and
61% (95% CI 60%-61%) of non-events having improved risk estimates (Supplementary Figure 6).
Similarly, chronic liver disease models demonstrated notable improvements across all data modalities,
with continuous net reclassification improvements ranging between 17%-88%.
These findings suggest that the interaction terms carry additional predictive information across
4
various disease and data modalities and survivalFM can model this residual contribution. While the
extent of improvement varied depending on the specific disease and dataset under study, improvements
were consistently observed across multiple disease areas and data types.
Disease-specific interaction profiles
A key advantage of survivalFM is that despite introducing a more complex layer of non-linearity
through the interaction terms, it still maintains interpretability and transparency of how the model
predictions are made. Analysis of the estimated interaction effects revealed that in many cases there
was a diverse interaction landscape contributing to these predictions, demonstrating that the observed
performance gains are likely to stem from the cumulative benefit of many small interaction effects
rather than a few prominent ones (examples shown in Supplementary Figures 7-11). Here, we will
highlight a few examples with the most notable performance gains.
Inclusion of interaction terms was particularly advantageous in liver-related conditions, such as
when predicting alcoholic liver disease or liver fibrosis and cirrhosis using standard risk factors or
metabolomic biomarkers. In both liver disease models derived using standard risk factors, among the
most prominent interactions were those among different cholesterol measures, cholesterol-lowering
medication, and sex (Supplementary Figures 7-8). These results suggest that the joint effects of these
risk factors further explain the risk of chronic liver disease outcomes beyond their additive linear
effects. The model for alcoholic liver disease also highlighted interactions with white ethnic background, suggesting variation in risk factor profile by ethnicity. Additionally, smoking status was
highly weighted both individually and in the interactions, aligning with the earlier research suggesting
that smoking may exacerbate the influence of the other risk factors in the development chronic liver
diseases [23].
In the case of liver disease models derived using metabolomics biomarkers, both alcoholic liver
disease and liver fibrosis and cirrhosis models weighted highly interactions across various amino acids
along with their individual effects (Supplementary Figures 9-10). These observations align with previously reported changes in amino acid metabolism related to chronic liver diseases [24, 25], with
these results suggesting that the associations of amino acids with the chronic liver disease outcomes
are also characterised by complex joint effects. Furthermore, both models emphasized a strong interaction between acetate and glutamine, with acetate having a notably pronounced interaction profile
in the model for alcoholic liver disease. Given the known roles of acetate and glutamine in alcohol
metabolism and lipid accumulation in the liver [26, 27], these findings indicate that the joint presence
of high levels of both these metabolites indicates an even higher risk of chronic liver diseases.
A contrasting example was type 2 diabetes modelled using clinical biochemistry and blood counts
data, which obtained the highest observed continuous NRI. Unlike the other examples, analysis of the
model coefficients revealed that the model weights were predominantly concentrated around glycated
hemoglobin (HbA1c) and its interactions across the other variables (Supplementary Figure 11). The
highest interaction weight was attributed to the interaction between HbA1c and glucose, which was
negatively weighted despite their positive individual effects. This likely reflects the fact that the simultaneous elevation of both HbA1c and glucose does not increase risk additively but rather relates to
them being correlated measures of blood glucose regulation and overall glycemic control. Additionally, the model highlighted positively weighted interactions of HbA1c with age, white ethnicity, and
urate levels, indicating these factors together might amplify the risk. In contrast, interactions between
HbA1c and reticulocyte count and body mass index were negatively weighted.
5
survivalFM benefits from large training data sizes
To understand the impact of training data size on model performance and the ability of survivalFM
to leverage interaction terms, we conducted analyses with models trained on varying-sized subsets of
the training data. Throughout these analyses, the test and validation sets were held fixed, allowing
us to analyze how changes only in the number of training individuals influence model performance.
Figure 3 shows the discriminatory performance of survivalFM as a function of the number of training
individuals for the input dataset involving standard risk factors (results for the other predictor sets
are shown in Supplementary Figures 12-14). These results demonstrate a clear dependency on large
sample sizes to uncover predictive interaction terms, with survivalFM generally requiring at least
50,000 individuals in training to outperform standard Cox regression. The discriminatory performance
of survivalFM shows a positive trend and increasing gap to standard Cox regression with increasing
sample sizes, although the gains often begin to plateau at the upper end of the sample size range.
survivalFM improves prediction performance in a clinical cardiovascular risk prediction scenario
To explore whether comprehensive interaction modeling via survivalFM could also refine well-established
clinical risk prediction models, we conducted analyses in a clinical CVD risk prediction setting using
predictors from the QRISK3 model [5]. QRISK models are Cox proportional hazard models used for
predicting the patient’s 10-year risk of CVD, recommended by the healthcare guidelines in the UK.
The latest version, QRISK3 from 2017 [5], incorporates a variety of risk factors and comorbidities,
along with a set of their interaction terms with age.
We aimed to determine if comprehensive modelling of the interaction terms among the QRISK3
risk factors using survivalFM could improve the model’s ability to predict cardiovascular risk. The
endpoint was defined as 10-year incidence of composite CVD, including coronary heart disease, ischemic stroke, and transient ischemic attack, and including both fatal and non-fatal events (Supplementary Table 5, Methods). Following the exclusion criteria from the QRISK3 derivation study, we
excluded participants with prior CVD diagnoses and those on a cholesterol-lowering medication at the
study entry. The baseline characteristics of the study population in this clinical prediction scenario are
detailed in Supplementary Table 6.
To ensure a fair comparison of the models, we retrained the QRISK3 model in the UK Biobank
considering the same set of risk factors (Methods). As prior research has shown QRISK3 to systematically overestimate CVD risk in the UK Biobank [28], retraining the model ensures an accurate
calibration for this cohort. We evaluated three models of increasing complexity: 1) a standard Cox
regression model with linear terms only, 2) a Cox regression model incorporating linear terms and age
interaction terms from the QRISK3 model, and 3) a survivalFM model including linear terms and all
potential factorized pairwise interaction terms.
In terms of discrimination performance measured by C-index, survivalFM showed statistically
significant improvements over the compared models (Figure 4a, Supplementary Table 7). Specifically,
it improved the discrimination performance by 0.0018 (95% CI 0.0013-0.0023) over the standard Cox
model with linear terms only, and by 0.0014 (95% CI 0.0010-0.0019) over the Cox model including
the also the current age interaction terms from QRISK3. Notably, the inclusion of age interaction terms
from QRISK3 improved the discrimination performance by 0.0004 (95% CI 0.0000-0.0008) over the
model with linear terms only. Hence, modelling the comprehensive interactions using survivalFM
more than four times improved the discrimination performance gains compared to only incorporating
6
the currently included age interactions.
To further assess how well the models reclassified individuals into appropriate risk categories, we
computed categorical net reclassification improvements (NRI) at the guideline recommended 10% absolute risk threshold [29]. Incorporating the currently included age interaction terms from QRISK3
resulted in an overall NRI of 0.66% (95% CI 0.40%-0.93%) compared to the model with linear terms
only (Figure 4b). The results for survivalFM showed a greater overall NRI of 1.47% (95% CI 1.12%1.82%), again demonstrating further gains beyond the currently included age interaction terms. survivalFM accurately reclassified 3.18% of individuals who experienced an event into the high-risk
category, while it inappropriately reclassified a smaller portion of 1.71% of non-events as high-risk
(Supplementary Table 7). These improvements are also visible in the reclassification plots (Figure 4c)
showing how the individual predictions change with the inclusion of new model terms. All models
were well calibrated (Supplementary Figure 15a) and exhibited broadly similar distributions across
the risk spectrum (Supplementary Figure 15b).
Analysis of the model coefficients from survivalFM revealed a broad array of interactions contributing to the CVD predictions. The ratio of total cholesterol to HDL cholesterol demonstrated the
most pronounced interaction profile among all predictor variables (Figure 5). This suggests that the
effect of the cholesterol ratio on CVD risk is influenced by the presence of other risk factors. For
example, the interaction weight for the cholesterol ratio with prevalent atrial fibrillation was negative,
despite both factors having positive individual weights. This suggests that these variables capture
partly overlapping aspects of cardiovascular risk. Atrial fibrillation is often associated with a broader
cardiovascular risk [30], which could already be reflected in the elevated cholesterol ratio. This may
thus imply that when both risk factors are present, they do not independently add to the risk. Comparing the estimated effects for the model terms overlapping between survivalFM and the standard Cox
regression model with linear and age interaction terms from QRISK3, the shared terms exhibited very
similar weights, with correlation of 0.97 between the estimated effects by the two methods (Supplementary Figure 16). This shows that despite the introduction of complex interactions, the fundamental
risk associations remain broadly consistent.
Discussion
Accurate prediction of disease onset and prognosis is essential to realize preventative medicine. In this
study, we have introduced survivalFM, a new machine learning method for multivariable time-to-event
prediction. The method extends the widely used Cox proportional hazards regression by estimating
all potential pairwise interaction effects among predictor variables on time-to-event outcomes, such
as disease onset. We have shown that estimating these comprehensive interaction effects improves
risk prediction and refines individual risk predictions across a range of common diseases, providing
more nuanced insights into the interplay among factors underlying disease risk. Since survivalFM
generalizes to other use cases, we expect this method to find applications in precision medicine and
benefit survival modelling in large studies involving many predictors.
Our results from UK Biobank revealed that survivalFM can identify predictive interaction terms,
which are missed when using standard Cox proportional hazards regression. This ability to uncover
predictive interaction terms extended across various disease outcomes and data modalities. Importantly, survivalFM consistently matched or surpassed the performance of the standard Cox regression
model. This robustness is by design, as survivalFM separates linear effects from interaction effects
and, by appropriate tuning of model hyperparametrs, can assign negligible weight to non-contributory
7
interaction effects while emphasizing predictive ones. These findings highlight the utility of survivalFM in refining risk prediction models across various prediction scenarios, including models derived from traditional clinical predictors and modern omics data types.
Our results further showed that survivalFM can add predictive value in practical clinical risk
prediction scenarios, such as in CVD risk prediction using predictors from the established QRISK3
model. CVD remains as the leading cause of mortality worldwide [31], making accurate risk stratification critical for healthcare providers to allocate preventive measures effectively. Applying survivalFM to QRISK3 risk factors improved both discrimination and reclassification at the clinically
recommended 10% risk threshold, more than doubling the performance gains obtained from the current model’s age-related interaction terms alone. For context, while a recent study [18] reported a
1.3% net reclassification improvement by adding a polygenic risk score to a CVD prediction model
in a similar scenario involving QRISK3 risk factors, survivalFM achieved a comparable improvement
by optimizing the use existing clinical variables.
A key strength of survivalFM is that despite introducing non-linearity through the comprehensive
interaction terms, it maintains interpretability by providing the estimated effects for both the individual terms and the approximated interactions. This is unlike many other advanced machine learning
techniques, which often lack transparency. Another advantage of survivalFM is a straightforward
training process, which only involves optimizing the regularization parameters and setting the rank
for factorizing the interaction parameters. We anticipate the accompanying R package will facilitate
rapid adoption of the method in other prediction studies.
Interpretation of the trained models suggested that in many cases numerous small interaction effects collectively enhanced the prediction accuracy, highlighting the importance of modeling the entire
interaction landscape. However, we also showed that capturing these interaction effects generally requires a large sample size. This can limit the method’s applicability in smaller cohorts and settings
with lower sample sizes. Therefore, future studies in adequately powered cohorts are needed to assess
the consistency of the identified interactions and gains in prediction accuracy across diverse populations. Whilst large sample size is needed, many biobank initiatives are emerging with clinical and
omics data at scale. Our results indicate such initiatives could be used as a base for discovering and
replicating comprehensive risk factor interactions that are missed by conventional statistical methods.
The generalizable nature of survivalFM makes it applicable also to other data modalities than those
highlighted in this paper. For instance, comprehensive modelling of interactions across omics data
modalities could provide valuable insights into the molecular interplay behind disease risk. Another
use case could be studies of protein interaction patterns in relation to disease onset. Recent studies
in UK Biobank have demonstrated the strong promise proteomics data in predicting various diseases
[32, 33, 34]. Given that proteomics data in UK Biobank comprises around 3 000 measured proteins,
the number of potential interactions is in millions. While the current sample size of 50 000 with
proteomics data in UK Biobank is at the lower limit for comprehensive interaction modelling, with a
sufficient sample size, survivalFM could be used to uncover protein interactions predictive of disease
onset and potentially provide further insights for personalized treatment strategies.
In conclusion, survivalFM provides an advancement in survival analysis, enhancing disease risk
prediction by effectively incorporating comprehensive interaction terms. Our findings provide a foundation for future research and translation of risk prediction models, emphasizing the importance of
interaction effects in understanding disease development and refining risk prediction models.
8
Methods
survivalFM: Extending the proportional hazards model with factorized interaction terms
Survival data
Throughout this paper, we assume right-censored survival data. This means that the outcome consists
of two variables: the event of interest (here, disease onset) and the time from the beginning of the study
period until either to the occurrence of the event, patient loss to follow-up, or end of the duration of
follow-up (i.e. right censoring). The survival dataset D consists of tuples D = {(xi , ti , δi )}N
i=1 , where
xi represents a vector of predictor variables for the individual i, ti marks the observed time to the
event of interest or to the point of censoring, and δi is an indicator function which denotes whether ti
corresponds to an actual event occurrence (δi = 1) or censored observation (δi = 0).
Model formulation
We base survivalFM on the widely used proportional hazards model [1] which relates the time until
an event occurs to a set of predictor variables through a hazard function of the from:
h(t|x) = h0 (t) exp(f (x))
(3)
where h0 (t) is a shared baseline hazard function that varies over time, and exp(f (x)) is a partial
hazard that describes the effects of the predictor variables on the baseline hazard. In the standard
formulation of the Cox proportional hazards model, the partial hazard exp(f (x)) is assumed to be
parametrized by a linear combination of the variables of the individual, f (x) = β x, with β representing the coefficients or parameters of the model assigning weights to the individual variables
xi .
In this study, in addition to the individual effects of the variables, we propose to add an approximation of all pairwise interaction terms using a factorized parametrization of the coefficients, following
the approach originally introduced along with factorization machines [11] in the context of recommender systems:
f (x) = β x +
pi , pj xi xj
(4)
β̃i,j xi xj = β x +
1≤i=j≤d
1≤i=j≤d
where ·, · denotes the inner product. The first part contains the linear effects of the predictor
variables in the same way as in the standard formulation of the Cox proportional hazards model. The
second part contains all pairwise interactions between the predictor variables xi and xj . However,
instead of directly estimating the interaction weights βi,j , the factorized parametrization approximates
the coefficients using an inner product between two latent vectors β̃i,j = pi , pj . The low-rank factor
vectors pi ∈ Rk and pj ∈ Rk are collected into a parameter matrix P ∈ Rd×k (Figure 1a). Rank k is
a hyperparameter that defines the dimensionality of the factor vectors, and usually the optimal rank of
the factorization is much lower than the number of input predictors (k d).
Parameter estimation
Following the standard Cox proportional hazards regression, we estimate the model parameters θ
using a partial likelihood function L(θ|D). For each individual who experiences an event at time t,
9
their likelihood contribution is the ratio of the hazard of that individual to the cumulative hazard of all
other individuals at risk at the same time point, multiplied across all individuals with event occurrence.
Formally, this can be expressed as follows:
L(θ|D) =
i:δi =1
h0 (t) exp(f (xi ))
exp(f (xi ))
=
h
(t)
exp(f
(x
))
0
j
j∈R(ti )
j∈R(ti ) exp(f (xj ))
(5)
i:δi =1
where xi denotes the vector of predictor variables for an individual i, ti is the observed event time
for individual i and R(ti ) denotes the risk set at time ti . Being in the risk set essentially means that
the individual has not had an event yet or that their censoring date has not passed yet. Here, f (x)
corresponds to the log-risk function from eq. (4) containing the individual effects and all pairwise
interaction terms in a factorized form. As the baseline hazard function h0 (t) is assumed to be shared
across all individuals, it cancels out when calculating the partial likelihood, hence eliminating the need
for its specification, a key feature of the Cox proportional hazards model rendering it semi-parametric.
To find the optimal parameters θ = {β, P}, instead of maximizing the partial likelihood, one can
equivalently minimize the negative log-likelihood to obtain a more convenient formulation. Taking
the logarithm of the partial likelihood function yields a log-likelihood function of the form:
l(θ|D) = log(
i:δi =1
⎛
⎛
⎞⎞
exp(f (xi ))
⎝f (xi ) − log ⎝
)=
exp(f (xj ))⎠⎠
j∈R(ti ) exp(f (xj ))
i:δi =1
(6)
j∈R(ti )
To overcome overfitting in scenarios involving many predictor variables, one can include regularization terms. Here, we consider L2 regularization (Ridge). Hence, the regularized learning problem
is given by:
2
(7)
arg min − l(θ|D) + λ1 ||β||22 + λ2 ||P||22
n
β,P
where λ1 and λ2 are the regularization parameters for the individual effects and the factorized interactions, respectively. Using separate regularization parameters for the individual effects and the
interactions allows for individual penalization of these two parts. The log-likelihood is scaled by a
factor of 2/n for convenience and to follow the definition from the popular glmnet R package [35],
used for the standard Cox regression comparison in this study.
Gradient of the negative log-likelihood function l(θ|D) with respect to the model parameters θ =
{β, P} is given by:
⎛
⎛
⎞⎞
∂
∂
⎝f (xi |θ) − log ⎝
l(θ) =
exp(f (xj |θ))⎠⎠
∂θ
∂θ
n:δn =1
j∈R(ti )
⎞
⎛
∂f (xj |θ)
exp(f (xj |θ))
|θ)
∂f
(x
j∈R(t
)
i
∂θ
i
⎠
⎝
=
−
∂θ
j∈R(ti ) exp(f (xj |θ))
(8)
(9)
n:δn =1
where
∂f (x|θ)
=
∂θ
xi
xi nj=1 pj,f xj − pi,f x2i
10
if θ is βi
if θ is pi,f
(10)
The sum nj=1 pj,f xj is independent of i and thus can be precomputed [11]. In addition, the
∂
gradients of the L2 regularization terms are given by ∂θ
λ||θ||22 = 2λθ.
To solve (7), we use an efficient BFGS (Broyden–Fletcher–Goldfarb–Shanno) quasi-Newton algorithm [36, 37, 38, 39], as implemented in the base R stats package [40]. In contrast to the standard
Newton-Raphson method, the BFGS algorithm uses an approximation of the Hessian to determine the
search direction. Due to the factorization of the interaction parameters, the number of estimated parameters remains moderate even in the presence of many predictor variables, making the computation
of the Hessian approximation feasible. Empirical evidence from our analyses indicated that alternative stochastic gradient descent (SGD)-based optimization methods, commonly employed in machine
learning, were not as effective here.
Study population
The UK Biobank is a comprehensive prospective cohort study serving as a major globally available
health research resource. It includes data from approximately half a million participants aged 37-73,
representing a sample from the general UK population. The participants were recruited through 22
assessment centers throughout England, Wales, and Scotland between 2006 and 2010. The followup is still ongoing. Further details of the study protocol and data collection are available online
(https://www.ukbiobank.ac.uk/media/gnkeyh2q/study-rationale.pdf) and in the literature [41]. The
UK Biobank study was approved by the North West Multi-Centre Research Ethics Committee and
all participants provided written informed consent. In this study, the data was accessed under UK
Biobank project ID 147811.
Predictor variable sets
Standard risk factors As standard risk factors, we included predictors from the ASCVD risk estimator plus [42, 12], which are also commonly featured in other primary prevention tools. These
demographic and cardiovascular risk factors have been shown to be predictive of diseases beyond
CVD [13, 14, 15]. These included age, sex, ethnic background, systolic and diastolic blood pressure, total, HDL, and LDL cholesterol, smoking status, prevalent type 2 diabetes (excluded from the
analyses related to type 2 diabetes), hypertension, and cholesterol-lowering treatment, further detailed
in Supplementary Table 8. This data was extracted from the data collected at the study’s initial recruitment visit. Prevalent diabetes status was extracted from primary care records, hospital episode
statistics, and self-reported conditions during the initial assessment. These standard risk factors were
included in all models trained.
Clinical biochemistry and blood counts A comprehensive set of clinical biochemistry measures
were provided by UK Biobank for blood samples taken at the initial recruitment visit and have been
previously described in the literature [43, 44]. These included hematologic markers (complete blood
counts, white blood cell populations and reticulocytes) and a wide range of blood biochemistry measures covering established risk factors, diagnostic biomarkers and other chracterisation of phenotypes,
such as measures for renal and liver function. Nucleated blood cell counts were excluded from our
analyses due to over 99% of the cohort having these recorded as missing or zero. We also excluded
estradiol, rheumatoid factor and lipoprotein (a), due to a large portion of the cohort (>20%) having
these recorded as missing or under the limit of detection. The blood sample handling and storage
11
protocol has been previously described in the literature [45]. A complete list of the included variables
is provided in Supplementary Table 9.
Metabolomics biomarkers The metabolomics data included 168 lipids and metabolites from a
high-throughput NMR metabolomics assay, available for the baseline blood samples from approximately 275,000 individuals in the UK Biobank. The metabolite data covers a wide range of small
molecules, such as amino acids, inflammation markers and ketones, as well as lipids, lipoproteins and
fatty acids. Percentage ratios calculated from these 168 original measures were excluded from our
analyses. Details of the metabolite data have been previously described [16]. A complete list of the
included metabolomics biomarkers is included in Supplementary Table 10.
Polygenic risk scores The polygenic risk score data included 53 polygenic risk scores (PRS) released by the UK Biobank and described in [17]. These included scores for both disease traits and
quantitative traits. In our analyses, we included only the standard PRS set obtained entirely from
external genome-wide association study (GWAS) data. As provided in the UK Biobank, the score
distributions were already centered at zero across all ancestries using a principal component-based
ancestry centering step. A complete list of the included variables is provided in Supplementary Table
11.
QRISK3 risk factors We matched the risk factors from the QRISK3 model with the corresponding
variables available in the UK Biobank. These variables were gathered during the baseline assessment
visit and included cholesterol levels measured from blood samples and prevalent disease diagnoses
obtained from linked hospital records, primary care data, and self-reported conditions. In instances
where an exact match for a QRISK3 model risk factor was unavailable in the UK Biobank, the closest
equivalent field was utilized. A complete list of the predictors and their corresponding UK Biobank
fields is provided in Supplementary Table 12.
Disease endpoint definitions
For the highlighted examples across different data modalities and 10-year incidence of ten different
disease outcomes, each of the outcomes was defined by the earliest occurrence in primary care, hospital episode statistics or death records, using the first occurrences data field from UK Biobank (category
1712). For lung cancer, we additionally included data from the cancer registry. The endpoints were
defined based on 3-character ICD-10 codes, detailed in Supplementary Table 2. Participants with a
previous diagnosis of the disease under study were excluded from the analysis of each endpoint.
The analysis of QRISK3 predictors focused on a 10-year composite CVD outcome, defined according to the original QRISK3 derivation study [5], including coronary heart disease, ischemic stroke,
and transient ischemic attack. The ICD-10 codes used are detailed in Supplementary Table 5. We used
the earliest recorded date of cardiovascular disease on any of the three data sources (primary care, hospital episode statistics and death records) as the outcome date, using the first occurrences data field
from UK Biobank (category 1712). Participants with a prior CVD diagnosis and those on a cholesterol
lowering medication at the start of the study were excluded from the analyses, following the exclusion
criteria from the original QRISK derivation study [5].
12
Data partitions and preprocessing
The model training and testing was performed using a 10-fold cross-validation approach. In each
cross-validation cycle, one of the 10 folds at a time was set aside as a test set, aggregating the remaining
partitions to form a training set. From this training set, we randomly selected 20% to use as the
validation set. Within each of the 10 cross-validation loops, the individual test set remained untouched
throughout model development and the validation set was used for model hyperparameter selection.
After selecting the optimal hyperparameters, the validation and training sets were combined to train
the final model. All 10 obtained models were then evaluated on their respective test sets, and the
results from the test sets were aggregated for the final evaluation.
For data preprocessing, log-normally distributed continuous variables (concerning clinical biochemistry markers, blood counts and metabolomics biomarkers) were log1p-transformed (i.e. taking
the logarithm of the given value plus one). Outliers exceeding 4 standard deviations from the mean
were winsorized. Continuous variables were scaled to zero mean and unit variance and categorical
variables were one-hot encoded. The means and standard deviations used for scaling were calculated
from the training set and subsequently applied to the validation and test sets. To maximize sample size
for the model training, missing values were imputed within the training set using k-nearest neighbors
(kNN) imputation (k = 10). To ensure no data leakage between the training, validation and test sets,
imputation was exclusively performed within the training set, sparing the validation and test sets from
the imputations. Hence, for the validation and test sets, only individuals with complete data available
were included.
Model hyperparameter tuning
In both the standard linear Cox regression and our proposed survivalFM method, we employed L2
(ridge) regularization to control model complexity and prevent overfitting. This requires tuning the
regularization parameter λ. For survivalFM, we allowed differing regularization strengths for the linear (λ1 ) and the interaction part (λ2 ), to separately control the influence of main effects and interaction
effects. All regularization parameters were optimized by considering a series of equally spaced values on a logarithmic scale between {1, 1−4 }. In addition, survivalFM requires setting the rank of the
factorization (k) for the interaction parameters, which was here set to k = 10.
Analysis of model performance
The standard linear Cox regression models used for the comparisons were trained using the glmnet
[46, 35] R package. Concordance indices (Harrel’s C-index) were computed using the R package
survival [47] and net reclassification improvements using the R package nricens [48].
Confidence intervals for all metrics were calculated with 1000 bootstrapping iterations. Statistical
inferences about differences were based on the distributions of bootstrapped performance difference
metrics by considering performances statistically significantly different when the 95% confidence
intervals did not overlap zero. All analyses were performed using R version 4.3.1 [40].
Code availability
The method developed in this study has been made available as an R-package and can be installed
from: https://github.com/aalto-ics-kepaco/survivalfm.
13
Data availability
UK Biobank data are available to researchers upon application at https://www.ukbiobank.ac.uk/enableyour-research/apply-for-access.
Author contributions
H.J. conceived the idea and designed the method with input from J.R. H.J. processed the data, wrote
the code, performed the analyses, prepared the figures and wrote the manuscript. J.R. supervised the
work and contributed to writing the manuscript. Both authors read and approved the final manuscript.
Acknowledgments
This work was supported by the Research Council of Finland grants 339421 (Machine Learning for
Systems Pharmacology, MASF, 2021-2025) and 345802 (AI technologies for interaction prediction
in biomedicine, AIB, 2022-2024). The authors acknowledge the computational resources provided by
the Aalto Science-IT project.
References
[1] Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society: Series B
(Methodological) 34, 187–202 (1972).
[2] Harrell Jr, F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing
models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in
medicine 15, 361–387 (1996).
[3] Corraini, P., Olsen, M., Pedersen, L., Dekkers, O. M. & Vandenbroucke, J. P. Effect modification,
interaction and mediation: an overview of theoretical insights for clinical investigators. Clinical
Epidemiology 331–338 (2017).
[4] Prospective Studies Collaboration. Age-specific relevance of usual blood pressure to vascular
mortality: a meta-analysis of individual data for one million adults in 61 prospective studies.
The Lancet 360, 1903–1913 (2002).
[5] Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk
prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.
BMJ 357 (2017).
[6] SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction
algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. European
Heart Journal 42, 2439–2454 (2021).
[7] Kaptoge, S. et al. World health organization cardiovascular disease risk charts: revised models
to estimate risk in 21 global regions. The Lancet Global Health 7, e1332–e1345 (2019).
[8] Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. The
Annals of Applied Statistics 2, 841–860 (2008).
14
[9] Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology 18, 1–12 (2018).
[10] Nagpal, C., Li, X. & Dubrawski, A. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of
Biomedical and Health Informatics 25, 3163–3175 (2021).
[11] Rendle, S. Factorization machines. In 2010 IEEE International conference on data mining,
995–1000 (IEEE, 2010).
[12] American College of Cardiology.
ASCVD Risk Predictor Plus
https://tools.acc.org/ASCVD-Risk-Estimator-Plus. Date accessed: 2024-04-30.
(2020).
[13] Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nature
Medicine 28, 2309–2320 (2022).
[14] de Bruijn, R. F. & Ikram, M. A. Cardiovascular risk factors and future risk of Alzheimer’s
disease. BMC medicine 12, 1–9 (2014).
[15] Koene, R. J., Prizment, A. E., Blaes, A. & Konety, S. H. Shared risk factors in cardiovascular
disease and cancer. Circulation 133, 1104–1114 (2016).
[16] Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nature Communications 14, 604 (2023).
[17] Thompson, D. J. et al. UK Biobank release and systematic evaluation of optimised polygenic
risk scores for 53 diseases and quantitative traits. MedRxiv 2022–06 (2022).
[18] Elliott, J. et al. Predictive accuracy of a polygenic risk score–enhanced prediction model vs a
clinical risk score for coronary artery disease. JAMA 323, 636–645 (2020).
[19] Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction
of cardiometabolic diseases and common cancers. Nature Medicine 26, 549–557 (2020).
[20] Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology 72, 1883–1893
(2018).
[21] Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B. & Vasan, R. S. Evaluating the added
predictive ability of a new marker: from area under the ROC curve to reclassification and beyond.
Statistics in Medicine 27, 157–172 (2008).
[22] Pencina, M. J., D’Agostino Sr, R. B. & Steyerberg, E. W. Extensions of net reclassification
improvement calculations to measure usefulness of new biomarkers. Statistics in Medicine 30,
11–21 (2011).
[23] Altamirano, J. & Bataller, R. Cigarette smoking and chronic liver diseases. Gut 59, 1159–1162
(2010).
[24] Sato, S. et al. Elevated serum tyrosine concentration is associated with a poor prognosis among
patients with liver cirrhosis. Hepatology Research 51, 786–795 (2021).
15
[25] Morgan, M. Y., Marshall, A., Milsom, J. P. & Sherlock, S. Plasma amino-acid patterns in liver
disease. Gut 23, 362–370 (1982).
[26] Sunami, Y. NASH, fibrosis and hepatocellular carcinoma: Lipid synthesis and glutamine/acetate
signaling. International Journal of Molecular Sciences 21, 6799 (2020).
[27] Zakhari, S. Overview: how is alcohol metabolized by the body? Alcohol research & health 29,
245 (2006).
[28] Parsons, R. E. et al. Independent external validation of the QRISK3 cardiovascular disease risk
prediction model using UK Biobank. Heart 109, 1690–1697 (2023).
[29] National Institute for Health and Care Excellence. Cardiovascular disease: risk assessment and reduction, including lipid modification (NICE guideline [NG238]) (2023).
https://www.nice.org.uk/guidance/ng238/chapter/Recommendationsstatins-for-primaryprevention-of-cardiovascular-disease. Date accessed: 2024-04-30.
[30] Odutayo, A. et al. Atrial fibrillation and risks of cardiovascular disease, renal disease, and death:
systematic review and meta-analysis. BMJ 354 (2016).
[31] World Health Organization. WHO reveals leading causes of death and disability worldwide:
2000-2019. World Health Organization (WHO) 1 (2020).
[32] Gadd, D. A. et al. Blood protein assessment of leading incident diseases and mortality in the UK
Biobank. Nature Aging 1–10 (2024).
[33] You, J. et al. Plasma proteomic profiles predict individual future health risk. Nature Communications 14, 7817 (2023).
[34] Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank.
Nature 622, 329–338 (2023).
[35] Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional
hazards model via coordinate descent. Journal of Statistical Software 39, 1 (2011).
[36] Broyden, C. G. The convergence of a class of double-rank minimization algorithms 1. general
considerations. IMA Journal of Applied Mathematics 6, 76–90 (1970).
[37] Fletcher, R. A new approach to variable metric algorithms. The Computer Journal 13, 317–322
(1970).
[38] Goldfarb, D. A family of variable metric updates derived by variational means, v. 24. Mathematics of Computation 21–55 (1970).
[39] Shanno, D. F. Conditioning of quasi-Newton methods for function minimization. Mathematics
of Computation 24, 647–656 (1970).
[40] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria (2023).
16
[41] Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank
participants with those of the general population. American Journal of Epidemiology 186, 1026–
1034 (2017).
[42] Arnett, D. K. et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular
disease: a report of the American College of Cardiology/American Heart Association Task Force
on Clinical Practice Guidelines. Circulation 140, e596–e646 (2019).
[43] Allen, N. E. et al. Approaches to minimising the epidemiological impact of sources of systematic
and random variation that may affect biochemistry assay data in UK Biobank. Wellcome Open
Research 5 (2020).
[44] Watts, E. L. et al. Hematologic markers and prostate cancer risk: a prospective analysis in UK
Biobank. Cancer Epidemiology, Biomarkers & Prevention 29, 1615–1626 (2020).
[45] Elliott, P. & Peakman, T. C. The UK Biobank sample handling and storage protocol for the
collection, processing and archiving of human blood and urine. International Journal of Epidemiology 37, 234–244 (2008).
[46] Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via
coordinate descent. Journal of Statistical Software 33, 1 (2010).
[47] Therneau, T. M. A package for survival analysis in R (2024). R package version 3.7-0.
[48] Inoue, E. nricens: NRI for risk prediction models with time to event and binary response data.
(2018). R package version 1.6.
17
a
Comprehensive
e interaction
n modelling
g by
y machine
e learning:: survivalFM
Linear effects
All potential
interaction effects
β ∈ Rd
β ∈ Rd×d
βi
βj
d
b
βi,j
d
Low-rank parameter matrix
of the factor vectors
Factorized
parametrization
P ∈ Rd×k
βi,j ≈ pi , pj pi
≈ pj
d
k
d
Method
d evaluation
Standard
d Cox
x regression
Time to event ~ β
x
SurvivalFM
Time to event ~ β x +
pi , pj xi xj
1≤i=j≤d
c
Disease
e prediction
n examples
s in
n the
e UK
K Biobank
i)) Case
e studies
s with
h various
s data
a modalities
ii)) Clinicall example
e – QRISK3
10-year onset of 10 different diseases
4 scenarios of predictors:
1)
Standard risk
factors
2)
Standard risk
factors
Biochemistry &
blood counts
3)
Standard risk
factors
Metabolomics
biomarkers
4)
Standard risk
factors
Polygenic risk
scores
10-year onset of cardiovascular disease
Established cardiovascular disease
predictors from QRISK3
Demographic chracteristics
Blood pressure
Body mass index
Prevalent diseases
Medications
Cholesterol measures
Figure 1: Method overview and evaluation examples. a) A machine learning survival
analysis method, survivalFM, is developed to estimate linear and all pairwise interaction effects between predictor variables using factorized parametrization of the interaction terms
βi,j ≈ pi , pj . d denotes the number of predictor variables and k is a hyperparameter defining the rank of the factorization of the interaction terms. The rank of the factorization is
typically much lower than the number of predictor variables (k d), enabling computation
of the interaction terms even in the presence of many input variables. b) The added value of
incorporating comprehensive interaction terms using survivalFM is assessed by comparing
the performance to the standard linear Cox proportional hazards regression. b) The performance is evaluated in various disease prediction examples: i) case studies with four different
predictor sets, each applied to ten disease examples; ii) a clinical example using predictors
from the QRISK3 cardiovascular disease (CVD) risk evaluation tool.
18
Figure 2: Comprehensive interaction modelling by survivalFM improves risk prediction performance across various diseases and data modalities. Comparison of the predictive performance of survivalFM to standard linear Cox proportional hazards regression
in terms difference in concordance index (Δ C-index) and continuous net reclassification
improvement (NRI). Results are shown for ten disease examples (y-axis) across four data
modalities: a) standard risk factors (blue; included in all models), b) clinical biochemistry
and blood counts (red), c) metabolomics biomarkers (orange) and d) polygenic risk scores
(green). Horizontal error bars denote 95% confidence intervals (CIs), estimated with bootstrapping over 1000 resamples. Sample sizes and event counts for each disease example
are provided in Supplementary Table 3.
19
SurvivalFM
Standard Cox regression
Alcoholic liver disease
Alzheimer's disease
Chronic kidney disease
Liver fibrosis & cirrhosis
Lung cancer
0.750
0.82
0.81
0.7775
0.80
0.7750
0.815
0.725
0.81
0.810
0.80
0.79
C−index
0.7725
0.79
0.78
0.77
100,000 200,000 300,000
0
Myocardial infarction
100,000 200,000 300,000
0.805
0.675
0.7700
0.78
0
0.700
0.800
0.650
0
Osteoporosis
100,000 200,000 300,000
0.795
0
Stroke
100,000 200,000 300,000
0
Type 2 diabetes
100,000 200,000 300,000
Vascular and other dementia
0.82
0.760
0.720
0.7625
0.804
0.81
0.755
0.7600
0.802
0.715
0.80
0.750
0.800
0.7575
0.710
0.745
0.79
0.798
0.7550
0
100,000 200,000 300,000
0
100,000 200,000 300,000
0
100,000 200,000 300,000
0
100,000 200,000 300,000
0
100,000 200,000 300,000
Number of individuals in training
Figure 3: Comprehensive interaction modelling using survivalFM benefits from large
training data sizes. Impact of the size of the training dataset (x-axis) on the discrimination
performance as measured by concordance index (C-index; y-axis), comparing survivalFM
(blue) to standard Cox regression (gray). Results are shown for the input dataset consisting
of standard risk factors. Sample sizes and event counts for each example are provided in
Supplementary Table 3.
20
Figure 4: Evaluation of survivalFM in a practical clinical cardiovascular risk prediction
scenario involving predictors from QRISK3. Performance of the models trained considering QRISK3 predictors for composite cardiovascular disease prediction (N = 344 292 with
complete data, 21 534 events). a) Discrimination performance evaluated using concordance
index (C-index) for three models: standard Cox regression with linear terms, standard Cox
regression with linear terms + age interaction terms from QRISK3 and survivalFM model
with linear terms + all factorized pairwise interactions. b) Categorical net reclassification improvements (NRI) at 10% absolute risk threshold, as compared to standard Cox model with
linear terms. Horizontal error bars denote 95% confidence intervals (CIs), estimated with
bootstrapping over 1000 resamples. c) Reclassification plots showing how the inclusion of
interaction terms in the more advanced models (y-axis, logarithmic scale) changes individual risk predictions, as compared to a standard linear Cox model (x-axis, logarithmic scale).
Black dotted vertical and horizontal lines show the 10% absolute risk threshold for the high
risk category.
21
Figure 5: Estimated model coefficients from survivalFM model, trained considering
the risk factors from QRISK3 for cardiovascular disease risk prediction. The coefficients are shown as the average of the estimated coefficients across the ten models trained
during the cross-validation. Estimated coefficients for a) the linear effects β and b) the interaction effects given by the inner product of the factor vectors βi,j = pi , pj . The dendrogram
shows a hierarchical clustering of the interaction profiles, using Euclidean distance as the
measure of similarity.
22
Business, Economy
Art, Design, Architecture
Science, Technology
Crossover
Doctoral Theses
Aalto DT 8/2025
ISBN 978-952-64-2351-7
ISBN 978-952-64-2352-4 (pdf)
Aalto University
School of Science
Department of Computer Science
aalto.fi
Download