Machine Learning for Precision Medicine Doctoral Thesis

Machine learning for precision medicine Heli Julkunen Aalto University publication series Doctoral Theses 8/2025 Machine Learning for Precision Medicine Heli Julkunen A doctoral thesis completed for the degree of Doctor of Science to be defended, with the permission of the Aalto University School of Science at a public examination held at the lecture hall T2 of the school on 31 January 2025 at 12 noon. Aalto University School of Science Department of Computer Science Supervising professor Prof. Juho Rousu, Aalto University, Finland Preliminary examiners Prof. Ron Do, Icahn School of Medicine at Mount Sinai, United States Dr. Taru Tukiainen, Institute for Molecular Medicine Finland (FIMM), Finland Opponent Prof. Maik Pietzner, Precision Healthcare University Research Institute, Queen Mary University of London, United Kingdom and Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Germany Aalto University publication series Doctoral Theses 8/2025 © Heli Julkunen ISBN 978-952-64-2351-7 (paperback) ISBN 978-952-64-2352-4 (pdf) ISSN 1799-4934 (paperback) ISSN 1799-4942 (pdf) http://urn.fi/URN:ISBN:978-952-64-2352-4 Unigrafia Oy Helsinki 2025 Heli Julkunen Machine Learning for Precision Medicine 200 precision medicine, machine learning, predictive modelling, survival analysis, risk prediction, metabolomics, drug combinations !! "" """ ! $" " ""!" "!'#" $#""$ "'"! ""! !#"$!"!"' !""$!!$ #!# "!!#!! " ! "! "!"! !$!$& "'"#"# "!# $#!#"!" $"' !"!"! ! " "#"!"!$ "$" ""!"'!! !$! "!! !%$ "! $#&"'"!" !!""$#"""!"&" "# "!"! ! !!! ""$!!#"" % !" !! $ #!!"! !# """! # "" ""!#"(" !!! ! !!!!" $""!"!!! ! " !"#" !"! % ! " """! #"! !!$ '!! $ $"$ &!""!' !!! "! !"$' # " "!" $ #"!' !%% !#!#"'& "' $" ! % $!"" !'!"" ! #"! "# '"$ " ""! !!"#"!&!"# "# !" !!! ! "" #"'!!#"! "" !!"#!"$" ! "" "" "" !!$ #!!!!# !!!% "!" $#!'!"#"! #" !! "" $"" $"!! ! "' " !$" ""! " $ ! !"!" " $# ' ! " "!" "! !!#"!!! " "!# !" '!# ""$" # ""! ! !!!!" "" "$"!!"! !""! !! ""$"" !"# "" $" " ""!" "! ! % ! Heli Julkunen !" Koneoppimisratkaisuja täsmälääketieteeseen !" !!! 200 täsmälääketiede, koneoppiminen, ennustava mallintaminen, elinaikaanalyysi, riskien ennustaminen, metabolomiikka, lääkeyhdistelmät *!***"#!$" $'#*!"'!" "$""##"""*!'""*'!+!" "! #!!" ""#"""##"'*!! !"'' ""#"" ! " ""!#"#$###!"#!"## !"'!!!" **!"'!!"$""!""'!+!"* """$!!$''""** **'+!!*"!## " ""!"!"$##"""#"#!*'""++ **#!"##" !##!+'"**"""#!"! #! """**! !"#! !#!"!! * !## "#"!"""!"'+'"*$"#" ""'"*!!"*""!"*$"#"" "'!!* *'"*+!* "!#"*!***""! *!!*$*"+! !!""**!$"!!"* "*!***"" !##"**'!""$#"#!" #!"!!"!" '+'"*!! ! !"#! ! $!!!*##!"""*!! ! !"#! !" $"!*#!#!"" !"*!##""##!"**'!" $#"#! !"!* "''!*"*" !"!"#!"!" **"*!"" ""* #!"#!" ##"""#!"##!**'!"!' "" $$!"""**"!!""*" !"""$ **'!"* !"*!!!#""# "'!!" '!"*""'!"*!'+$*!! ! #!#""! #"#$ *'""+*! !"#! !#!"!!!'+'"***$*!"+"! "" **"#"#!""#!"$"##!"! " !"$""" !"! #! ! #!"!!!##!!"! #!!!" "#"""#!"$*!"+"! !!!*#!#!!!"** !"*"$"" """# #!"#$! !"#! !#!"!"""""$!"# #!"##""#!*!"$# $#"#!" **"*!"" "##""" ##"" !#!"!!!$ ""#"$! "#!! #! ""*"!""#!" " "'+#""*!"*! !"#! ! $!! "$""*!!*$*"+! !!!""'"##""*"!" +'+!"!"*$*"''"!""!"!""*!' "!" "!"*!***""!!* Acknowledgements This doctoral thesis marks the culmination of an incredible journey of research, learning, and growth. It has been shaped by the support and expertise of many remarkable individuals and institutions, to whom I am deeply grateful. First and foremost, I would like to express my heartfelt thanks to my supervising professor, Juho Rousu, for believing in me and my abilities from the very beginning. Your guidance, encouragement, and leadership have been invaluable during this academic journey. From my first steps as a research intern in your group to this point, your mentorship has played a pivotal role in my development as a scientist, and your expertise and leadership have left an undeniable mark on this work. I extend my deepest gratitude to all my co-authors. My special thanks goes to Anna Cichońska, the second author on most of the publications in this work. You introduced me to the world of scientific research during my early days as an undergraduate student at Aalto University and continued to guide me through my growth as a researcher and later as a colleague at Nightingale Health. Beyond your exceptional scientific expertise and guidance, you have become one of my dearest friends. Your brilliance, kindness, and dedication are truly inspiring, and I deeply cherish the many experiences we have shared, both professionally and personally. I also wish to thank all co-authors involved in the drug combination prediction work, including Sandor Szedmak, Prson Gautam, Jane Douat, Tapio Pahikkala, and Tero Aittokallio. Sandor, your boundless ideas and deep mathematical knowledge continues to amaze me, and I have greatly enjoyed our many discussions. Prson, thank you for taking the time to validate my predictions in the lab. Jane, thank you for your enthusiasm and commitment to your work. Tapio and Tero, I am deeply grateful for your fantastic ideas, expertise and guidance, which were integral to the success of this work. I am truly grateful for the unique contributions each of you brought to this work. I would also like to thank all co-authors from the metabolomics research conducted at Nightingale Health, especially Kirsten Schut, Sini Kerminen, 7 Acknowledgements Valtteri Mäkelä, Kristian Nybo, Jussi Nokso-Koivisto, Sara Lundgren, Nurlan Kerimov, Luke Jostins-Dean, Mika Tiainen, Harri Koskela, Eline Slagboom, Antti Kangas, Pasi Soininen, Peter Würtz, and Jeffrey Barrett. It has been a pleasure to work with all of you. Sini and Kirsten, you have both been wonderful colleagues and friends, and I admire your enthusiasm and dedication to your work. I also wish to extend my thanks to all other colleagues at Nightingale Health, who made many aspects of this work possible. A special thanks to Tuija, Salla, Valtteri, Kristian, Jussi, Sara, Nurlan, Joni, Vilma, Emmi, Ella, Juuso, and many others for making my time there so memorable. I also wish to thank the founders of Nightingale Health —Antti, Pasi, Peter and Teemu— for creating an exceptional environment for innovation that enabled me to contribute to impactful and world-leading research projects. I thank my pre-examiners, Ron Do and Taru Tukiainen, for their thorough evaluation of this work and for their insightful and encouraging feedback. I am also deeply grateful to Maik Pietzner for accepting the role of opponent and for dedicating the time to engage with my research and this process, it is an honour to have you as the opponent. I thank those who facilitated the collection and processing of the datasets used in this research. Special thanks to the UK Biobank Laboratory and Data Access Teams for the seamless collaboration in creating the metabolomics datasets now accessible to researchers worldwide. I am also grateful to the UK Biobank study participants, whose contributions were essential to this work. I would also like to acknowledge Aalto University’s CS-IT team for providing the computational resources that supported many of the machine learning aspects in this work. Along the way, I have had the privilege of working with many wonderful people. I am deeply grateful to all my past and present colleagues and collaborators for the enriching discussions and memorable moments we have shared. To my colleagues at the Aalto Computer Science department —Maryam, Riikka, Tian, Anchen, Gianmarco, Robert, Emily, Taneli, Vikas, and Elena— thank you for the engaging discussions, group lunches, and shared activities. A special thanks to Maryam and Riikka for your kindness and shared office moments that have brightened my days. Finally, to all my friends —those from earlier days, those met during my university years, and those encountered at work— thank you for bringing joy and balance to my life. To all my family, thank you for your belief in me and for standing by me through every step. A special thanks to my sister Henna for your constant encouragement, and to Peter for your unwavering support and love. I am deeply grateful to each of you for being part of this journey. Helsinki, December 22, 2024, Heli Julkunen 8 Contents Acknowledgements 7 Contents 9 List of Publications 13 Author’s Contribution 15 1. Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Risk prediction in precision medicine . . . . . . . . 1.1.2 Treatment strategies in precision medicine . . . . . 1.2 Research aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Summary of research contributions . . . . . . . . . . . . . . . 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 19 20 21 22 24 26 2. Background 27 2.1 Molecular profiling in precision medicine . . . . . . . . . . . 27 2.2 Disease risk prediction in precision medicine . . . . . . . . . 28 2.2.1 Fundamentals of time-to-event modelling . . . . . . 29 2.2.2 Risk prediction in preventive healthcare . . . . . . 32 2.2.3 Emerging trends in risk prediction . . . . . . . . . . 33 2.3 Metabolomics for predicting disease risks . . . . . . . . . . . 34 2.3.1 Human metabolome . . . . . . . . . . . . . . . . . . . 34 2.3.2 High-throughput profiling of metabolites . . . . . . 35 2.3.3 Prior research in metabolomics and disease risk . 36 2.4 Treatment strategies in precision medicine . . . . . . . . . . 38 2.4.1 Drugs and drug targets . . . . . . . . . . . . . . . . . 38 2.4.2 Drug combination treatments . . . . . . . . . . . . . 39 2.4.3 Quantifying drug combination effects . . . . . . . . 39 2.4.4 Prior research in predicting drug combination effects 41 2.5 Machine learning and statistical inference . . . . . . . . . . 43 2.5.1 Multiple regression . . . . . . . . . . . . . . . . . . . . 44 9 Contents Logistic regression . . . . . . . . . . . . . . . . . Cox proportional hazards regression . . . . . . Statistical inference and interpretation . . . . Regularization . . . . . . . . . . . . . . . . . . . . Interactions . . . . . . . . . . . . . . . . . . . . . . Factorization machines . . . . . . . . . . . . . . . . . Standard formulation of factorization machines Higher-order factorization machines . . . . . . 45 46 47 48 49 50 50 52 3. Predictive modelling of drug combination effects (Publication I) 3.1 Foundations of comboFM . . . . . . . . . . . . . . . . . . . . . 3.2 Drug combination dataset . . . . . . . . . . . . . . . . . . . . . 3.3 Evaluation settings . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Accurate predictions of drug combination effects . . . . . . 3.5 Experimental validation of predicted drug combinations . 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 55 56 56 57 58 58 2.5.2 4. Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) 4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 UK Biobank . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 THL Biobank . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Estonian Biobank . . . . . . . . . . . . . . . . . . . . . 4.1.4 NMR metalomic biomarker profiling . . . . . . . . . 4.2 Predictive modelling of severe infectious disease and COVID19 risk using metabolomic biomarkers (Publication II) . . . 4.2.1 Study setting and methodology . . . . . . . . . . . . 4.2.2 Associations of individual metabolomic biomarkers with severe pneumonia and COVID-19 . . . . . . . 4.2.3 Multi-biomarker score stratifies the risk of severe infectious diseases . . . . . . . . . . . . . . . . . . . . 4.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Systematic characterization of the associations of metabolomic biomarkers across common diseases (Publication III) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Study setting and methodology . . . . . . . . . . . . 4.3.2 Biomarker associations across a broad range of diseases . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Insights into shared biomarker signatures . . . . . 4.3.4 Accounting for the effects of lipid lowering medications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Metabolomic and genomic prediction of common diseases (Publication IV) . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 61 61 61 62 62 62 63 63 64 64 67 68 68 69 69 72 72 73 Contents 4.4.1 4.4.2 4.4.3 Study setting and methodology . . . . . . . . . . . . Metabolomic risk scores stratify disease risk . . . . Conclusions . . . . . . . . . . . . . . . . . . . . . . . . 73 74 76 5. Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) 5.1 Foundations of survivalFM . . . . . . . . . . . . . . . . . . . . 5.2 Study population and evaluation settings . . . . . . . . . . . 5.3 Improved prediction of disease risk across various settings 5.4 Enhanced cardiovascular risk prediction performance . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 77 78 78 81 81 6. Concluding remarks 83 References 87 Publications 107 11 List of Publications This thesis consists of an overview and of the following publications which are referred to in the text by their Roman numerals. I Heli Julkunen, Anna Cichońska, Prson Gautam, Sandor Szedmak, Jane Douat, Tapio Pahikkala, Tero Aittokallio, Juho Rousu. Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nature Communications, December 2020. II Heli Julkunen, Anna Cichońska, P. Eline Slagboom, Peter Würtz, Nightingale Health UK Biobank Initiative. Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general population. eLife, May 2021. III Heli Julkunen, Anna Cichońska, Mika Tiainen, Harri Koskela, Kristian Nybo, Valtteri Mäkelä, Jussi Nokso-Koivisto, Kati Kristiansson, Markus Perola, Veikko Salomaa, Pekka Jousilahti, Annamari Lundqvist, Antti J. Kangas, Pasi Soininen, Jeffrey C. Barrett, Peter Würtz. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nature Communications, February 2023. IV Nightingale Health Biobank Collaborative Group. Metabolomic and genomic prediction of common diseases in 700,217 participants in three national biobanks. Nature Communications, November 2024. List of authors in alphabetical order: Jeffrey C. Barrett, Tõnu Esko, Krista Fischer, Luke Jostins-Dean, Pekka Jousilahti, Heli Julkunen, Tuija Jääskeläinen, Antti Kangas, Nurlan Kerimov, Sini Kerminen, Anastassia Kolde, Harri Koskela, Jaanika Kronberg, Sara N. Lundgren, Annamari Lundqvist, Valtteri Mäkelä, Kristian Nybo, Markus Perola, Veikko Salomaa, Kirsten Schut, Maiju Soikkeli, Pasi Soininen, Mika Tiainen, Taavi Tillmann, Peter Würtz. 13 List of Publications V Heli Julkunen, Juho Rousu. Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank. Submitted, July 2024. Available on medRxiv. 14 Author’s Contribution Publication I: “Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects” The research project was initiated and conceptualized in collaboration with me, Anna Cichońska, Sandor Szedmak, Tapio Pahikkala, Tero Aittokallio, and Juho Rousu. I had a primary role in designing and implementing the comboFM machine learning framework and performing the computational analyses, with input from Anna Cichońska. The design of the computational analyses and evaluation protocols was shaped through collaborative input from Anna Cichońska, Sandor Szedmak, Tapio Pahikkala, and Juho Rousu. The random forest comparison experiment was performed by Jane Douat under my supervision. I prepared all the figures. Based on the computational predictions, experimental wet-lab validation of drug combinations was designed and performed by Prson Gautam. The results were analyzed jointly by the authors. The initial draft of the article was written by me, and then revised and edited by Anna Cichońska, Tero Aittokallio and Juho Rousu, with contributions from Sandor Szedmak and Tapio Pahikkala. Publication II: “Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general population” The research was conceptualized and designed jointly by all authors. The data from UK Biobank was curated and processed jointly by me and Anna Cichońska. The computational and statistical analyses were implemented and performed mainly by me. The results were jointly interpreted by all authors. I prepared all figures except Figure 9, which was prepared by Anna Cichońska. All authors contributed to writing the article. 15 Author’s Contribution Publication III: “Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank” The research was designed and conceptualized in collaboration with me, Anna Cichońska, Antti Kangas, Pasi Soininen, Jeffrey Barrett and Peter Würtz. The data from UK Biobank was curated and processed jointly by me and Anna Cichońska. The computational and statistical analyses were implemented and performed mainly by me. The online tool to visualize the results and query the summary statistics was implemented by me. I prepared all the figures. Mika Tiainen, Harri Koskela, Kristian Nybo, Valtteri Mäkelä, Jussi Nokso-Koivisto, Pasi Soininen and Antti Kangas performed the NMR metabolomic biomarker measurements, quantification and quality control. Kati Kristiansson, Markus Perola, Veikko Salomaa, Pekka Jousilahti, Annamari Lundqvist contributed to data collection at the THL Biobank, which was used for replication analyses. The results were jointly interpreted and the article written by me, Anna Cichońska, Jeffrey Barrett and Peter Würtz. Publication IV: “Metabolomic and genomic prediction of common diseases in 700,217 participants in three national biobanks” This project was carried out in collaboration with three major biobanks, comprising UK Biobank, Estonian Biobank and THL Biobank, and included investigators from five institutions. The research was conceptualized primarily by me, Peter Würtz and Jeffrey Barrett. The data was curated and statistical analyses performed by me, Nurlan Kerimov, Sini Kerminen, Sara Lundgren, Kirsten Schut and Luke Jostins-Dean. My primary contribution was in designing methodology for the study, implementing the computational frameworks and pipelines for model training and evaluation, and conducting cross-biobank analyses of the metabolomic biomarker scores. This included analyses of the model performance and calibration, along with the related visualizations. Analyses and visualizations related to polygenic risk scores, clinical risk factors and multiple time points were performed by Sini Kerminen, Sara Lundgren, Kirsten Schut and Luke Jostins-Dean. After the initial submission of the article, I transitioned to Aalto University, and the subsequent revisions have been carried out by the other co-authors. Harri Koskela, Valtteri Mäkelä, Kristian Nybo, Maiju Sokkeli and Pasi Soininen performed NMR metabolomic biomarker measurements, quantification and quality control. The manuscript was written by Jeffrey Barrett, with contributions from me, Luke Jostins-Dean, Peter Würtz, Sini Kerminen, Sara Lundgren, Nurlan Kerimov and Kirsten Schut. Detailed contributions from all authors are given in the original publication. 16 Author’s Contribution Publication V: “Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank” I conceived the idea of developing a machine learning method for survival analysis to account for the comprehensive interaction effects among predictor variables using concepts derived from factorization machines applied in Publication I. The methodology was mainly designed by me, with input from Juho Rousu. The R package implementation of the method was written by me. The data from UK Biobank was curated and processed by me. The computational analyses were designed, implemented and performed by me. I analyzed the results and prepared all the figures. The manuscript was written mainly by me. 17 1. Introduction 1.1 Motivation Precision medicine is increasingly regarded as the future of healthcare, where prevention and treatment strategies are implemented by accounting for the unique characteristics of individual patients or subgroups of patients. This concept has also gained attention from policymakers; for instance, in 2015, former U.S. President Barack Obama launched the Precision Medicine Initiative, aimed at "delivering the right treatment at the right time, every time, to the right person" 1,2 . Similar initiatives have emerged globally 3–6 , reflecting the growing recognition that precision medicine has the potential to reduce healthcare costs, improve patient outcomes, and enable more effective, targeted interventions 7–9 . The concept of precision medicine is not new; for instance, blood types have been used to personalize blood transfusions for over a century 10 . However, the prospect of widely applying this concept has been notably improved by the advances in molecular profiling ’omics technologies, such as genomics, transriptomics, proteomics and metabolomics, which have increased the amount of molecular data that can be collected for each individual patient. The improved scalability and reduced costs of these platforms have facilitated their widespread adoption in research settings and are beginning to contribute to their use in clinical practice 11,12 . These advancements, coupled with the development of computational methods for analyzing the vast amounts of generated data, have enhanced our understanding of the molecular alterations underlying disease development. Consequently, this has created opportunities for discovering effective treatments, identifying disease biomarkers, and developing risk prediction models 8,13–15 . The challenge lies in translating the continuously increasing volumes of complex data into actionable insights for precision medicine. Computational methods, such as machine learning, are crucial in this endeavor, as 19 Introduction they enable the integration, analysis, and interpretation of vast, heterogeneous datasets. Machine learning builds upon statistical learning theory to learn patterns from observed data to predict outcomes in previously unseen instances. In the context of precision medicine, machine learning methods can be applied, for instance, to predict disease risks and responses to treatments, based on the unique molecular and clinical characteristics of the patients 16–19 . This dissertation develops machine learning frameworks and performs statistical analyses to contribute to various aspects of precision medicine, leveraging recently emerged extensive biomedical data collections. The research focuses on two primary themes: (1) improving disease risk prediction, particularly through the use of metabolomic biomarkers and the development of machine learning methods for risk modelling, and (2) advancing the discovery of effective treatments by developing a machine learning framework to predict the effects of drug combination therapies. Hence, this dissertation contributes to advancing both prevention and treatment aspects of precision medicine. 1.1.1 Risk prediction in precision medicine Identifying undiagnosed individuals at an elevated risk of developing a disease is essential for precision medicine to enable targeted interventions that prevent or delay disease onset. For instance, in current clinical practice, cardiovascular risk prediction models are widely used to guide the allocation of lipid-lowering treatments based on factors like cholesterol levels, blood pressure and age 20–22 . However, risk prediction can be improved and extended to a wider range of diseases using the ’omics technologies. This is exemplified by polygenic risk scores, which aggregate data from numerous genetic variants to estimate an individual’s susceptibility to disease. The success of polygenic risk scores has been largely driven by the availability of genomic data at scale 23,24 . As other types of omics data become similarly accessible, they also present potential for risk prediction and may provide even greater prediction accuracy. Metabolomics, the comprehensive profiling of metabolites, has also gained traction as a promising tool for molecular profiling of disease risk. Similar to many routine clinical risk factors for chronic diseases, the levels of metabolites reflect the broad downstream effects of genetic, environmental, and lifestyle factors 25 , making them attractive candidates for risk prediction. Some individual blood metabolites are already routinely used in clinical practice; for instance, glucose is a marker for type 2 diabetes, creatinine is used for evaluating kidney function and cholesterol levels are used to evaluate cardiovascular disease risk. However, detailed metabolomic profiling holds promise to further improve risk prediction. Research studies have widely established associations of metabolomic biomarkers with 20 Introduction the risk of cardiometabolic diseases 26–28 , and emerging evidence suggests they may play a broader role in overall human health and disease 29,30 . To fully establish the potential of metabolomics in risk prediction, it is essential to incorporate metabolomic profiling data into large prospective cohort studies, as they provide the extensive longitudinal data required to identify and validate reliable associations with disease risk. Achieving this requires mature metabolomics platforms capable of profiling large datasets with high consistency and reproducibility. Among the available technologies, nuclear magnetic resonance (NMR) spectroscopy has emerged as an appealing platform due to its ability to reproducibly quantify abundant circulating metabolites at relatively low cost and high throughput 28,31 . Given its scalability and reproducibility, NMR is particularly well-suited for large-scale studies, addressing key requirements for both epidemiological research and eventual clinical translation. Therefore, applying NMR-based metabolomic profiling in extensive cohort studies like the UK Biobank presents a unique opportunity to explore the broader applicability of metabolomics for risk prediction across a wide range of diseases. In addition to informative data, the accuracy of risk prediction also depends on the computational methods used to derive the prediction model. Since disease risks are often time-to-event outcomes, it is essential to employ appropriate modeling techniques that account for censored data and varying follow-up times. Established methods in epidemiological research, such as Cox proportional hazards regression 32 , are widely used but have limitations in capturing non-linear relationships and interactions present in complex biological data. Advanced machine learning methods, such as random survival forests 33 and deep survival models 34,35 , are better equipped to handle such complexities. However, they often sacrifice interpretability, which is often desired for translational applications 36–39 . Therefore, there is a need for methods that balance the ability to model complex relationships with the interpretability necessary for translational risk prediction applications. 1.1.2 Treatment strategies in precision medicine When disease prevention is unattainable, precision medicine aims to optimize treatment by taking into account individual characteristics of the patient. Improved understanding of the molecular underpinnings of diseases has driven the discovery of new drug targets and enabled the development of targeted therapies, which selectively interfere with the molecular drivers of disease progression 40,41 . Targeted therapies have shown remarkable success particularly in various cancers, where research into the molecular alterations underlying various subtypes of cancer has uncovered key drivers of tumor growth 42–45 . By targeting these molecular drivers, such therapies can halt cancer cell proliferation while minimizing damage to 21 Introduction healthy tissue, resulting in less side effects and providing a safer alternative to conventional cytotoxic chemotherapies. Consequently, targeted drugs have become a major therapeutic class in cancer treatment, improving response rates and prolonging survival in patients with corresponding molecular alterations 42 . Despite the transformative impact of targeted therapies on cancer treatment, they are not universally effective. A notable challenge to the success of targeted therapies is the development of drug resistance, which often arises as cancer cells adapt to the selective pressure imposed by singletarget therapies 46–48 . As a result, relying on the modulation of only a single target often proves insufficient. To overcome this limitation, combination drug therapies have emerged as a promising strategy and are increasingly used in treatment of many cancers and other complex diseases 48–50 . In addition to overcoming single-drug resistance, drug combination treatments can also improve therapeutic efficacy by acting through different molecular targets or mechanisms, as well as potentially decrease the treatmentrelated toxicity by lowering the doses of the drugs needed to generate the same response 42,47,48,51 . The importance of combination therapies in cancer treatment is evident, with regulatory bodies such as the U.S. Food and Drug Administration (FDA) approving 81 new drug combinations for oncology between 2011 and 2021 52 . However, identifying effective drug combinations presents a notable challenge, as the number of potential drug combinations far exceeds what can be feasibly tested in a clinical settings. Although extensive highthroughput drug combination screens have been conducted for various cancer types 53–55 , computational approaches, such as machine learning, will be essential in narrowing down the extensive combinatorial possibilities 56,57 . However, accurately modeling the effects of drug combinations is challenging due to the complexity of responses across varying doses and the molecular heterogeneity of cancer. This creates a need for methods that can accurately predict drug combination effects across different cancer subtypes and dosing regimens. 1.2 Research aims The research presented in this dissertation is organized into three main chapters, each corresponding to specific research aims. Despite the promise of machine learning in identifying effective drug combinations, a majority of the existing methods overlook the fact that the effects of drug combinations can vary depending on their doses. Most existing approaches simplify the problem to either a binary or continuous regression problem of synergy, aiming to determine whether the combined summary effect is greater Predictive modelling of drug combination effects 22 Introduction than what would be expected from the individual drugs 58,59 . Since a drug combination that is synergistic at one dose could be antagonistic or additive at another, there is a need for methods that can predict combination effects at specific doses. This is also essential for translating the predictions into clinical practice, as lower doses are often better tolerated by the patients. Addressing this challenge was the motivation behind the first research aim: • Research aim 1: Develop a machine learning framework capable of predicting drug combination responses at the level of individual doseresponses. Given the potential of metabolomics to inform on disease risks, this dissertation builds upon a novel NMR metabolomics dataset from the UK Biobank, containing data for over 100,000 individuals. This metabolomics dataset is the largest of its kind to date and provides a unique opportunity to assess the relevance of metabolomic biomarkers at scale and across a diverse array of diseases. The first study using this dataset was prompted by the global health concern posed by the coronavirus pandemic in 2019 (COVID-19). While previous research had established older age and chronic health conditions as risk factors for severe infections 60,61 , the role of molecular factors like metabolomic and inflammatory biomarkers on susceptibility remained unclear. This led to the formulation of the second research aim: Predictive modelling of disease risks using metabolomic biomarkers • Research aim 2: Evaluate the potential of metabolomic biomarkers to inform on the susceptibility to severe infectious diseases and COVID-19. To further understand the broader impact of the metabolomic biomarkers on health and disease, we sought to systematically analyze the associations of individual metabolomic biomarkers across all common diseases, extending also beyond those of cardiometabolic origin. This led to the formulation of the third research aim: • Research aim 3: Systematically characterize associations between individual metabolites and disease risk across a broad range of common diseases. The strong associations observed across various diseases motivated us to assess the ability of risk prediction models derived using the metabolomic biomarkers to predict leading chronic diseases with significant public health relevance. This further raised important questions about the transferability of such risk prediction model across different population 23 Introduction cohorts. These considerations led to the formulation of the subsequent research aim: • Research aim 4: Evaluate the potential of metabolomic risk scores in predicting the risk of leading chronic diseases and investigate their transferability across different population cohorts. Machine learning with comprehensive interaction modelling for disease risk prediction One challenge in building accurate disease risk prediction models is accounting for the complex relationships that can influence outcomes in non-linear ways, such as interactions among predictor variables. Inspired from the results in Publications II-IV and the method proven effective in Publication I, we sought to explore whether the technique proven effective in modelling interactions in drug combination data from the first research aim, could be adapted to modelling interactions among predictor variables for time-to-event outcomes, such as disease risks. This culminated in the formulation of the final research aim: • Research aim 5: Develop a machine learning method to account for comprehensive interaction effects among predictor variables in modelling time-to-event outcomes. 1.3 Summary of research contributions The main contributions of this dissertation in terms of key results and impact are given in the original Publications I-V and summarised below. • In Publication I, we developed comboFM, a machine learning framework to predict dose-specific drug combination responses. While earlier methods had primarily addressed direct prediction of drug combination synergy 58,59 , comboFM enables detailed dose-specific response predictions with the ability to subsequently evaluate synergies at different dose combinations. comboFM learns from experimental drug combination response data by using higher-order factorization machines 62 , a recently proposed machine learning technique that can account for comprehensive interactions in large datasets. Our computational experiments using data from a cancer cell line pharmacogenomic screen demonstrated that comboFM obtains high prediction accuracy in various practical prediction scenarios. The practical utility of comboFM was further confirmed through experimental validation of previously untested drug combinations, demonstrating its potential to advance the development of combination therapies. 24 Introduction • In Publication II, we explored whether metabolomic biomarkers measured by NMR could predict susceptibility to severe pneumonia and COVID-19. Our findings revealed new molecular insights into the susceptibility to severe infections. In addition to inflammatory biomarkers, many blood biomarkers previously associated mainly with cardiometabolic diseases, such as amino acids and fatty acids, were also predictive of hospitalization for severe infections years later. We derived a metabolomic risk score, a risk prediction model based on metabolomic biomarkers, and showed its strong association with increased risk of both severe pneumonia and severe COVID-19, even up to a decade after the initial blood samples were collected. These findings highlight the potential of NMR metabolomic profiling to complement existing risk identification tools and improve understanding of the molecular mechanisms underlying susceptibility to severe infections. • In Publication III, we systematically characterized the associations of NMR metabolomic biomarkers across the incidence, prevalence and mortality of hundreds of common diseases in the UK Biobank. Our findings revealed a broad range of novel biomarker associations, including risks of various cancers, mental health outcomes, and musculoskeletal disorders. While earlier research had primarily linked these biomarkers to cardiometabolic conditions 26–28 , this study demonstrates their broader relevance across a wide range of diseases and in a uniquely large population-scale setting. Additionally, we identified both similarities and differences in biomarker association patterns across various diseases, suggesting underlying systemic connections as well as distinct molecular signatures unique to certain conditions. Furthermore, the associations were shown to widely replicate in THL Biobank. Our results highlight the potential of these biomarkers to provide valuable insights into multi-disease risk. • In Publication IV, motivated by the results from Publications II-III, we derived metabolomic risk scores to predict the risk of 12 leading causes of morbidity. This study leveraged a uniquely large population sample from three major national biobanks, UK Biobank, Estonian Biobank, and Finnish THL Biobank, totaling over 700,000 participants. This unique setup enabled the validation of metabolomic risk scores across multiple population cohorts, demonstrating consistent cross-biobank replication. This study was also the first to systematically compare the predictive performance of metabolomic risk scores and polygenic risk scores across multiple diseases within such a large population. We showed that metabolomic risk scores exhibited stronger associations with future disease risk than polygenic scores for the majority of diseases studied. These results demonstrate the potential of metabolomic biomarkers to inform 25 Introduction disease risk prediction. • In Publication V, we introduced survivalFM, a novel machine learning method for time-to-event risk prediction. This method effectively accounts for comprehensive interactions among predictor variables. It utilizes a low-rank factorized parametrization approach, a concept adapted from factorization machines applied for drug combination prediction in Publication I. This approach facilitates simultaneous estimation of comprehensive interaction effects, even with numerous predictor variables, for predicting time-to-event outcomes, such as disease risks. In contrast to many other advanced machine learning approaches in survival analysis 33–35 , survivalFM produces interpretable models, which is often important for translational applications. Using data from the UK Biobank, we demonstrated that survivalFM improves risk prediction performance across various data sources and disease outcomes, including a practical cardiovascular risk prediction scenario. These findings highlight the potential of survivalFM to provide more nuanced insights into disease risk and to enhance the accuracy of risk predictions. 1.4 Outline This dissertation is structured as follows. Chapter 2 provides the biological and computational background related to the various aspects of precision medicine addressed in this dissertation. It establishes the current state of the field and justifies the research aims outlined in the introduction. Chapters 3, 4, and 5 present the research contributions from of Publications I-V, organized according to the main themes: i) predictive modelling of drug combination effects, ii) predictive modelling of disease risks using metabolomic biomarkers and iii) extending risk prediction methodology by accounting for comprehensive interaction effects. Finally, Chapter 6 provides conclusions and discusses potential future research directions. 26 2. Background 2.1 Molecular profiling in precision medicine Broad molecular profiling, commonly referred to as ’omics, refers to a set of high-throughput experimental technologies and related scientific fields dedicated to a large-scale, quantitative analysis of molecular data, particularly as they relate to human health and disease. The suffix -ome was initially used in the context of genome studies, referring to the complete set of genetic material within an organism. Over time, -ome has come to denote the totality of various biological entities, and correspondingly, ’omics has become a general term for the study of comprehensive biological datasets. The concept of ’omics cascade illustrates the hierarchical flow of biological information, from DNA (genome) to RNA (transcriptome), proteins (proteome), and metabolites (metabolome), each layer offering different insights into biological processes (Figure 2.1). ’Omics technologies have provided precision medicine with the tools to operate at a detailed molecular level. These technologies, especially in the field of genomics, have already driven translational research and are gradually beginning to contribute to clinical practice. For instance, genome-wide association studies (GWAS) have provided insights into the genetic basis of various complex diseases 63,64 . These findings have not only enriched our understanding of the genetic architecture of disease but also proven valuable for therapeutic development, as targets backed by genetic evidence are more likely to succeed in clinical trials 65,66 . Furthermore, GWAS findings are now being utilized in other applications, such as in genetic risk prediction using polygenic risk scores, which may soon be incorporated as stratification tools in clinical settings 23,67–69 . As ’omics technologies become more affordable and scalable, including those that measure biological entities closer to the phenotype (e.g., the metabolome and proteome), they enable a more comprehensive view of the molecular effects related to disease outcomes. Such developments will 27 Background advance precision medicine by enabling the identi cation of previously undetected biomarkers, re ning disease subtyping, improving the accuracy of risk prediction models, and informing the development of more effective therapeutic strategies 8,13 15 . Genomics DNA Transcription Transcriptomics T RNA Translation Proteomics Proteins Protein activity Metabolomics Metabolites Phenotype (disease) Figure 2.1. ’Omics cascade illustrating the flow of biological information from genomics through transcriptomics and proteomics to metabolomics. 2.2 Disease risk prediction in precision medicine Disease risk prediction refers to the concept of estimating the likelihood that an individual will develop a particular disease or condition within a speci ed time frame, based on the analysis of one or more risk factors. These factors can include, for instance, environmental, behavioral, and clinical attributes, or molecular biomarkers measured through traditional laboratory chemistry or emerging omics technologies. Prediction can be done based on a single risk factor, such as a biomarker that directly correlates with disease risk, or it can involve complex statistical or machine learning models that integrate multiple risk factors to provide an estimation of disease risk. The ultimate goal of disease risk prediction is to enable precision medicine by stratifying individuals into different levels of risk. By early identi cation of individuals at high risk, targeted prevention strategies and interventions can be implemented, potentially altering the disease trajectory and improving health outcomes. 28 Background In this dissertation, the main focus is placed on predicting the first onset of disease in currently undiagnosed individuals. However, risk prediction can also be applied in the context of predicting disease progression or recurrence. 2.2.1 Fundamentals of time-to-event modelling Time-to-event analysis, also known as survival analysis, is a branch of statistics and machine learning which deals with estimating the time until the occurrence of an event, such as disease onset, based on given characteristics of an individual. Survival analysis involves time-to-event data, which measures the duration from the start of a follow-up period to the occurrence of a specified event, such as the onset of a disease. A central feature of such data is the issue of censoring, which arises when the precise timing of the event remains unknown. This can occur if the study ends before the event occurs, or if an individual is lost to follow-up. The most frequently encountered type of censoring is right-censoring, which occurs when the event has not taken place by the end of the follow-up period. Other forms include left-censoring, where the event takes place before the individual is enrolled in the study, and interval-censoring, where the event is known to have occurred within a specified time interval, though the precise time remains unknown. This dissertation focuses on the analysis of right-censored data (Figure 2.2). In such cases, the outcome is represented by two variables: the event of interest, such as disease onset, and the time from the start of the follow-up period until either the event occurs or the individual is censored. Formally, the time-to-event dataset D is defined as a collection of tuples D = {(x i , T i , ± i )} N i =1 , where x i denotes a vector of predictors characterizing individual i , T i represents the observed time until the occurrence of the event or censoring, and ± i is an indicator variable that takes the value of 1 if T i corresponds to an observed event and 0 if it is a censored observation. Time-to-event data and censoring In survival analysis, two central functions are used to model time-to-event data: the survival function and the hazard function. The survival function, denoted as S(t), estimates the probability that an individual will survive beyond a specified time point t. It relates to the cumulative distribution function (CDF) of the event times, F(t), which quantifies the cumulative risk of the event occurring by time t: Survival and hazard functions S(t) = P(T < t) = 1 ° F(t) Rt (2.1) where 0 ∑ S(t) ∑ 1, and F(t) = 0 p(z), dz, with p(t) representing the probability density function (PDF) that characterizes the likelihood of the event occurring at a specific time t. 29 Background Event ( = 1) Censored observation ( = 0) Baseline Follow-up period End of follow-up period t = T1 t=0 t=0 t = T2 t=0 t = TN Time Predictors xi Risk model g(xi) Estimated risk h(t|xi), S(t|xi) Figure 2.2. Right-censored survival data in time-to-event modelling. Individuals are observed over a follow-up period, during which the occurrences of the event of interest is recorded. Right-censoring arises when the event has not happened by the end of the follow-up period or if individuals are lost to follow-up. Predictor variables, x i , for an individual i are measured at baseline and are used to estimate the individual s risk using a risk model g(x i ). This risk estimate can be expressed through functions such as the hazard function h(t|x i ) or the survival function S(t|x i ), which quantify the likelihood of the event occurring over time. The hazard function, h(t), provides a complementary view by describing the instantaneous risk of the event occurring at time t, conditional on survival up to that point: h(t) = p(t) P(t ≤ T < t + Δ t|T ≥ t) = lim S(t) Δ t→0 Δt (2.2) Together, the survival and hazard functions provide complementary insights into the timing and likelihood of events. While the survival function typically decreases over time as the probability of survival diminishes, the hazard function may vary depending on how the risk evolves over time. Time-to-event risk prediction models are typically developed using data from epidemiological cohort studies, which collect baseline measurements from participants at the beginning of the study ( t = 0), capturing a comprehensive range of predictor variables x (Figure 2.2). These variables typically include sociodemographic factors, lifestyle behaviors, and family medical history, obtained through questionnaires, along with physiological and clinical measurements, such as blood pressure and body mass index. Additionally, biological samples, such as blood, are often collected and stored for subsequent laboratory analyses, Data sources and predictor variables 30 Background including various ’omics measurements. Participants are then followed longitudinally, with both the occurrence and timing of relevant events recorded, generating time-to-event data for risk modeling. In modern cohort studies, follow-up data is often linked from electronic health records, providing continuous and detailed tracking of health outcomes over time. The duration of follow-up for disease risk modeling typically ranges from 5 to 10 years, though this can vary depending on the objectives and design of the study. Disease events captured in cohort studies can be categorized into prevalent events and incident events. Prevalent events refer to occurrences that have taken place before the participant enrolled in the study, whereas incident events are those that arise during the follow-up period. Risk prediction models are primarily concerned with incident events, as they represent new cases of disease that develop after the baseline assessment. Prevalent events are typically excluded from risk prediction analyses, as individuals with existing disease have already experienced the outcome, complicating baseline risk assessment and confounding the analysis. The objective of time-to-event risk prediction is to accurately predict survival, cumulative risk, or hazard functions for a given individual, at any time of interest t based on a set of predictor variables (Figure 2.2). Typically, such models are expressed in terms of the hazard function h(t|x). However, since the survival, cumulative risk, probability density and hazard functions are mathematically related, knowing one allows derivation of the others. One of the most widely used methods for modelling time-to-event data is the Cox proportional hazards regression model 32 , which will be covered in detail in Section 2.5.1. Despite the widespread use of the Cox proportional hazards regression, it also has limitations, particularly its assumption of linear relationships between predictor variables and the log-hazard function. To address these limitations, various alternative survival analysis methods have been developed. Advanced machine learning techniques, such as random survival forests 33 and deep survival models 34,35 are capable of capturing complex interactions and non-linear relationships within the data. While these methods can potentially improve predictive performance, they also come with trade-offs, such as increased computational complexity and the need for extensive hyperparameter tuning. More importantly, these models sacrifice interpretability, which is often needed in clinical decision making where understanding the risk factors underlying the prediction is essential 36–39 . Furthermore, some studies suggest that the performance improvements offered by these advanced methods are modest and require large sample sizes to be effective 70,71 . Therefore, there is a need for methods that balance the ability to model complex relationships with the interpretability needed for translational applications. Methods in time-to-event risk modelling 31 Background The predictive performance of risk prediction models generally depends on the outcome being modeled, the population to which the model is applied, the amount of information the predictors provide about the outcome, and the suitability of the modelling methodology. Once a prediction model is developed, its performance must be evaluated using different metrics. Some of the most common metrics used for this purpose are discrimination, calibration, and reclassification. Discrimination measures how well the model distinguishes between individuals who will experience the outcome and those who will not, typically assessed by the area under the ROC curve (AUC) or the concordance index (C-index) 72 . Calibration evaluates how closely the predicted probabilities align with the actual outcomes. A well-calibrated model should have predictions that match the observed risk of events. Reclassification assesses whether a new model improves risk classification. For instance, the net reclassification index (NRI) quantifies how well individuals are reclassified into more appropriate risk categories, based on whether they eventually experience the event or not 73 . External validation is crucial for ensuring that the model generalizes to different populations or settings. This involves applying the model to new datasets to confirm that its performance remains consistent. Guidelines like TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) 74 provide best practices for the development, validation, and transparent reporting of prediction models. Evaluation of risk prediction models 2.2.2 Risk prediction in preventive healthcare Advances in disease risk prediction have been shaped by large, longitudinal studies that have identified key risk factors and developed disease risk prediction models. For instance, the Framingham Heart Study, initiated in 1948, was instrumental in establishing several risk factors for cardiovascular disease (CVD) in the general population, such as smoking, obesity, high cholesterol, and hypertension 75–77 . These findings, along with contributions from other seminal studies 78,79 , provided the basis for stratifying CVD prevention and treatment strategies by emphasizing the role of individual risk factors. The Framingham Risk Score, developed as a result of this work, remains widely used to estimate an individual’s CVD risk and to guide interventions such as lipid-lowering therapies 77,80,81 . Similar models have since been developed for other chronic diseases, such as the Gail Model for breast cancer risk prediction 82 , QDiabetes risk score for predicting type 2 diabete risk 83 , and the QCancer tool for assessing the risks of common cancers 84 . Cardiovascular medicine has been at the forefront of integrating risk factor analysis and predictive models into routine clinical practice. This is exemplified by the widespread development of CVD risk prediction models 32 Background globally. In addition to the Framingham Risk Score, notable examples include the SCORE model in Europe 22 , the QRISK model in the UK 20 , FINRISKI in Finland 85 , the ACC/AHA-ASCVD pooled cohort equations in the US 86 , and the China-PAR model 87 . These CVD risk prediction models have become an integral part of local clinical guidelines, facilitating personalized prevention and treatment strategies, such as lifestyle changes and lipid-lowering treatments 88,89 . However, while risk prediction has been successfully integrated into preventative cardiovascular medicine and serves as an example of precision medicine, this level of application has largely not extended to other disease areas. As the global burden of avoidable chronic health conditions is rising, the need for effective prevention strategies has become increasingly important 90,91 . In response, healthcare systems have introduced preventative health programs. For instance, the UK National Health Service (NHS) offers a Health Check every five years to identify individuals at high risk for conditions such as heart disease, kidney disease, type 2 diabetes, and stroke, using a set of standard clinical risk factors 92,93 . However, there is growing recognition that integrating ’omics data could enhance the accuracy of risk prediction and broaden its application to a wider range of diseases. Due to the increased scalability and reduced costs brought by technological advancements, these ’omics platforms are becoming more practical and timely for wider use in healthcare systems 11,12 . 2.2.3 Emerging trends in risk prediction The increasing availability of comprehensive datasets is driving advancements in disease risk prediction. Large-scale initiatives like the UK Biobank 94 , FinnGen 95 , All of Us 96 , and Our Future Health 4 are providing vast resources that integrate electronic health records with genetic, clinical, lifestyle, and multi-omics data from hundreds of thousands of participants. These datasets are valuable not only for investigating disease risk factors at the population level but also for the development and validation of prediction models with translational potential. For instance, recent studies have highlighted the value of utilizing electronic health records from such biobanks to predict disease risk, offering another dimension of health system-derived data for improving risk prediction 97–99 . Among the emerging approaches in risk prediction, polygenic risk scores (PRS) have gained considerable attention. PRS is a numerical value that quantifies an individual’s genetic susceptibility to a specific trait or disease based on the effects of multiple genetic variants identified through genome-wide association studies (GWAS). In recent years, PRSs have been developed and tested for a range of diseases, including coronary artery disease 23,100 , stroke 101 , chronic kidney disease 102 , type 2 diabetes 23,103 and common cancers 104–107 . While able to stratify population risk, es- 33 Background pecially for individuals in the highest risk percentiles, their predictive performance varies across diseases and populations. PRSs tend to perform better in populations of European descent, where most GWAS data have been generated, raising concerns about generalizability and equity in more diverse populations 69,108 . While polygenic risk scores have gained notable attention, other ’omics data modalities also hold potential for disease risk prediction. For instance, recent large-scale studies using plasma proteomics data from the UK Biobank have demonstrated that proteomic risk scores can effectively stratify risk for various common diseases and potentially improve the prediction accuracy compared to traditional risk factors 109–111 . Another promising area for risk prediction is metabolomics, which will be explored for risk prediction in Publications II-IV. The following section will delve into metabolomics, providing the necessary background for understanding its role in disease risk prediction. 2.3 2.3.1 Metabolomics for predicting disease risks Human metabolome Analogous to genome, transcriptome and proteome, metabolome refers to the complete set of metabolites found in an organism. Metabolites reflect the downstream effects of genetic, transcriptomic, proteomic, and other environmental factors, providing a detailed snapshot of the organism’s biochemical state close to the phenotype 112 . Current analytical platforms can detect hundreds to thousands of metabolites depending on the technology used, however, no single platform can capture the entire metabolome. Metabolites encompass a wide range of molecules, serving as reactants, intermediates, or end products of enzyme-mediated biochemical reactions. The sources of metabolites can be divided into endogenous and exogenous origins. Endogenous metabolites emerge from internal biochemical reactions, while exogenous metabolites are introduced into the system from external vectors such as dietary intake or pharmacological treatments. Consequently, metabolite levels are dynamic and reflect variations influenced not only by intrinsic factors such as genetics and disease states, but also by extrinsic factors such as diet and medications 113–115 . In terms of their chemical composition, metabolites exhibit stark differences when compared to genes, transcripts, and proteins. Unlike the chemical homogeneity observed in nucleotides forming genes and transcripts, or amino acids constituting proteins, metabolites span an extensive range of chemical classes. These can range from small hydrophilic sugars to large hydrophobic lipids. Additionally, the concentration ranges of 34 Background metabolites can vary largely, with abundant metabolites being measurable in the millimolar range, while less abundant ones are only detectable in the picomolar range 116,117 . Various sample types, including blood urine, cerebrospinal fluid and saliva, can be used to study the human metabolome 118 . However, this dissertation specifically focuses on blood metabolomics. Blood serves as the primary transport medium for nutrients, metabolites, hormones, and waste products throughout the body, offering a comprehensive reflection of the metabolic activities occurring in different organs and tissues. This holistic representation makes blood an ideal sample for metabolomics profiling, particularly in the study of systemic diseases. Moreover, blood is easily accessible through minimally invasive sampling techniques, making it a practical and convenient option for both clinical and research settings. 2.3.2 High-throughput profiling of metabolites Metabolomics requires technologies capable of measuring a wide range of metabolites simultaneously. The two most commonly used methods for this purpose are nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS). Both techniques generate spectra, in which peak positions and intensities correspond to specific metabolites. However, NMR and MS operate on different principles, with each offering distinct advantages and limitations, making them complementary tools in metabolomics research 119 . NMR detects and quantifies metabolites based on magnetic properties of protons. It provides high reproducibility, requires minimal sample preparation, and preserves sample integrity. NMR is inherently quantitative, allowing metabolite concentrations to be measured directly from signal intensity without external calibration. However, its sensitivity is low, making it more suitable for detecting metabolites present in medium to high concentrations 120 . In contrast, MS identifies metabolites based on their mass-to-charge ratio after ionization. It offers much higher sensitivity and can also detect low-abundance metabolites. However, MS has difficulty distinguishing between metabolites with the same mass, and incurs higher costs due to the need for reference substances. Moreover, MS often faces issues with instrument consistency and signal drift, leading to lower repeatability 121,122 . Such variability can pose challenges in applications like disease risk prediction, where detecting subtle differences in metabolite levels among healthy individuals is important for accurately distinguishing risk profiles. Therefore, while MS can measure the broadest of metabolites, NMR is appealing in terms of reproducible quantification of abundant circulating metabolites at scale and at a relatively low cost. Publications II-IV of this dissertation utilize quantitative metabolomics data generated by the Nightingale Health NMR platform 28,31 . This plat- 35 Background form simultaneously quantifies 168 metabolomic biomarkers and 81 ratios of these, including lipoprotein lipids, fatty acids, ketone bodies, amino acids, and other low-molecular-weight metabolic biomarkers, as well as lipoprotein subclass composition. 2.3.3 Prior research in metabolomics and disease risk Advances in metabolomic profiling technologies over the past two decades have led to a substantial increase in metabolomics research, expanding our understanding of the metabolomic complexity underlying health and disease. Due to the vast number of studies in this field, this section focuses on examples of human studies and findings related to blood metabolomics and incident disease risk. However, metabolomics research encompasses a variety of other study types, including investigations into dietary impacts on metabolism 114,123 , elucidation of the genetic determinants of metabolite levels 124–127 , and assessments of causal relationships through Mendelian randomization studies 128–130 . However, such studies are beyond the scope of this motivating review. Initial applications of metabolomic profiling were seen in small cohorts, typically involving a few hundred to a few thousand participants, often in case-control settings, and utilized both MS and NMR-based assays. Despite limited sample sizes, these first prospective cohort studies established key associations, such as linking metabolites like branched-chain and aromatic amino acids to the risk of type 2 diabetes 131–133 . Similarly, certain fatty acids and lipid biomarkers were associated with cardiovascular events, although findings were less consistent 134–137 . These initial investigations revealed the role of metabolic alterations in cardiometabolic diseases and highlighted the need for larger studies to enhance generalizability and statistical power. Since 2015, advances in metabolomic profiling platforms, particularly those based on NMR, have enabled metabolomic profiling in large epidemiological cohort studies 28,31 . For example, Ahola-Olli et al. confirmed the association of branched-chain and aromatic amino acids with type 2 diabetes risk in a cohort of 11,896 young adults. They also identified additional biomarkers, including various fatty acids and lipoprotein lipids, linked to hyperglycemia and diabetes onset 138 . Similarly, Borges et al. demonstrated the role of certain fatty acid biomarkers in cardiovascular risk through a meta-analysis of 16,126 participants from six cohorts, revealing distinct association patterns for stroke and coronary heart disease 139 . Other studies have also identified unique metabolomic signatures for different subtypes of cardiovascular disease, including stroke, coronary artery disease, and peripheral artery disease 140,141 . These findings highlight the complexity of the metabolomic contributions to these conditions. While most research has focused on cardiometabolic diseases, individual 36 Background metabolites have also been linked to other conditions. For instance, Fischer et al. found four metabolomic biomarkers to be associated with all-cause mortality risk in two cohorts totaling 17,345 individuals 142 . Similarly, Ritchie et al. associated the biomarker glycoprotein acetyls (GlycA) with chronic inflammation and the long-term risk of severe infection in three population-based cohorts involving 11,825 participants 143 . Kettunen et al. tested the associations of GlycA with the incidence of over 400 common diseases in a cohort of 11,861 individuals, revealing a broad spectrum of significant associations with various disease outcomes 144 . Tynkkynen et al. and van der Lee et al. identified specific metabolites linked to the risk of dementia, Alzheimer’s disease and cognitive decline 145,146 . Additionally, individual metabolites have been associated with the risk of various cancers, including breast 147 , colorectal 148 , and pancreatic 149,150 cancers. As sample sizes grew, the focus shifted from associating individual biomarkers with specific disease outcomes to appreciating the whole metabolomic profile as a broader reflection of disease risk. For instance, Deelen et al. developed a prediction model for all-cause mortality using 14 metabolites identified from metabolomic profiles in a study involving 44,168 individuals from 12 cohorts 30 . This model demonstrated improved prediction accuracy compared to standard risk factors for mortality, suggesting that metabolomic profiles can reflect broader frailty and systemic risks associated with mortality. Similarly, Pietzner et al. examined multimorbidity, the simultaneous presence of multiple chronic conditions, using MS-based metabolomics profiles from 11,966 individuals. Their findings revealed shared metabolite associations among multiple incident diseases, highlighting the systemic relevance of blood metabolites 29 . While the potential of metabolomic biomarker data to inform disease risks has become evident, the challenge remains in translating these findings into practical tools for risk prediction. A key step forward will be the expansion of metabolomics data within large population cohorts that are linked to comprehensive health outcome data. Prior to the research presented in this dissertation, the largest risk prediction studies in metabolomics have involved around 10,000 participants, with multicohort meta-analyses extending this to approximately 40,000. Publications II-IV of this dissertation introduce and build upon a novel NMR metabolomics dataset from the UK Biobank, comprising metabolomic profiles from over 100,000 individuals. This substantial increase in sample size provides a unique opportunity to assess the broader relevance of metabolomic biomarkers in health and disease and to develop robust risk prediction models. Following the release of the UK Biobank metabolomics dataset and the presentation of initial findings in Publications II-III, several subsequent studies have leveraged this resource to further explore the role of meta- 37 Background bolomics in disease risk prediction. For instance, Buergel et al. 151 trained deep learning models to predict common multi-disease outcomes based on the metabolomic profiles from UK Biobank. Their models stratified risk across a wide range of conditions and improved risk prediction beyond established clinical risk factors in predicting diseases such as type 2 diabetes, dementia, and heart failure. Other prediction studies have focused on specific disease outcomes, demonstrating the role of metabolomics in the risk of type 2 diabetes 152 , dementia 153,154 , heart failure 155 , hepatocellular carcinoma and chronic liver disease 156 , Parkinson’s disease 157 and aging 158 , among others. Collectively, these studies highlight the growing research interest in using metabolomics for risk prediction, with promising implications for precision medicine. 2.4 Treatment strategies in precision medicine Beyond disease risk prediction, another central aspect of precision medicine relates to optimizing treatments by considering the unique variability among patients. 2.4.1 Drugs and drug targets Drugs are chemical entities designed to alter biological processes within the body, primarily with the goal of treating, curing, or preventing diseases. The effectiveness of a drug is fundamentally linked to its ability to interact with specific molecules in the body, drug targets. These targets are typically proteins, such as enzymes, receptors, or ion channels, that are intrinsically linked to the pathophysiology of diseases. For instance, statins, a widely utilized class of cholesterol-lowering agents, exert their effects by inhibiting HMG-CoA reductase, an enzyme crucial to cholesterol synthesis in the liver 159 . The specificity with which a drug engages its target is often considered crucial for its therapeutic efficacy and safety, as drugs that precisely interact with their intended targets can theoretically produce strong therapeutic effects while minimizing side effects 160 . However, the inherent complexity of human biology and disease frequently challenges this ideal. In practice, drugs with broad, multi-target effects can sometimes be more effective or safer than highly selective agents 161–163 . Consequently, multi-targeted monotherapies and drug combinations are increasingly recognized as valuable strategies for managing complex diseases, such as cancers 49,162 . 38 Background 2.4.2 Drug combination treatments Drug combination treatments involve the simultaneous use of two or more therapeutic agents to target multiple pathways or mechanisms associated with a disease. This approach has become increasingly important in managing complex conditions like cancer and infectious diseases, where single-agent therapies often fall short due to the complex pathophysiology and adaptive nature of these diseases 49,164–166 . Drug combinations can enhance therapeutic efficacy through diverse mechanisms. For instance, in cancer treatment, combining cytotoxic agents with targeted therapies can improve outcomes by simultaneously inducing direct tumor cell death while inhibiting specific signaling pathways that contribute to cell proliferation. A classic example is the combination of chemotherapeutic agents, such as cisplatin or paclitaxel, with targeted therapies like bevacizumab, an anti-angiogenic drug. This combination approach inhibits cell division while concurrently interfering with the vascular supply of the tumour, resulting in improved survival rates for patients with specific cancer types, such as non-small cell lung cancer 167,168 . Moreover, drug combinations can help mitigate the emergence of drug resistance, a major challenge in cancer treatment 46–48 . Resistance frequently develops as cancer cells adapt to the selective pressure of single-agent therapies by activating alternative pathways to sustain growth. Combination therapies, which simultaneously target multiple pathways or mechanisms, can disrupt various aspects of cancer signaling networks, complicating the tumor cells’ ability to adapt and evade treatment. For instance, in melanoma patients with BRAF V600E mutations, combining MEK and BRAF inhibitors has been shown to reduce resistance and improve overall treatment efficacy compared to targeting either pathway alone 169 . An additional benefit of using drug combinations is the potential reduction in toxicity associated with high doses of single agents. By administering lower doses of each drug in combination, the overall toxicity can be minimized, which in turn can improve the overall safety profile of the treatment 170,171 . 2.4.3 Quantifying drug combination effects The sheer number of potential drug combinations far surpasses what can be evaluated in clinical trials. Consequently, researchers often employ high-throughput screening techniques to systematically assess the effects of various drug combinations in pre-clinical models, such as established cell lines and those derived from patient samples. The primary goal of highthroughput screening is to identify drug combinations that demonstrate a synergistic effect, in which the combined effect exceeds what would be expected from the individual effects. Synergistic combinations uncovered 39 Background a b c %-growth through high-throughput screening can then be prioritized for further pre-clinical and clinical evaluation. Cancer cell lines are essential models for studying the effects of drug combinations. Derived from various tumor types, they capture a wide range of genetic and phenotypic diversity, allowing researchers to assess how different drug combinations perform across multiple cancer types and genetic backgrounds. While these cultured cells may not fully replicate the complexity of tumors in patients, they offer a valuable experimental platform for investigating cancer biology and evaluating the therapeutic efficacy of anticancer compounds in high-throughput experiments 172–174 . In the high-throughput screening of drug combinations, researchers frequently employ large-scale dose-response matrix experiments to analyze drug combination effects. In these experiments, cell lines are exposed to various concentration pairs of two drugs, along with each drug individually, to measure their responses. The resulting dose-response matrix displays the effects for every concentration pair, which can be directly compared to the monotherapy effects along the matrix edges (Figure 2.3). The responses are typically quantified using metrics such as the percentage of cell growth inhibition, survival, or death relative to a control. Analyzing data from dose-response matrices allows researchers to categorize the effects of drug combinations as synergistic, antagonistic, or Figure 2.3. An illustrative example of a 7 × 7 dose-response matrix design used to quantify the effects of drug combinations. In this matrix, the rows and columns represent varying concentrations of two drugs, D 1 and D 2 , respectively. Each entry within the matrix indicates the observed combination effect, such as measured by percentage growth of the cell line. The monotherapy effects of each drug are represented along the edges of the matrix. By comparing the observed combination effects to the expected effects calculated using a reference model of non-interaction, the drug combinations can be categorized as a) additive, b) synergistic, or c) antagonistic based on deviations from these expected values. Reprinted and modi ed from the author s Master s thesis. 40 Background additive. This classification is based on how the observed response to the combination deviates from the expected response, which is calculated using a reference model of non-interaction. Synergism occurs when the observed effect of the drug combination exceeds the expected effect derived from responses to the individual drugs, while antagonism refers to a combination effect that is less than expected. Commonly used models for quantifying drug combination synergy include the highest single agent (HSA) model 175 , the Bliss independence model 176 , and the Loewe additivity model 177 , each differing in their underlying assumptions. 2.4.4 Prior research in predicting drug combination effects Despite advancements in high-throughput screening technologies, systematically testing every possible drug combination is impractical, as the number of potential drug and dose combinations increases exponentially with the number of considered drug components and their respective doses. This presents a particular challenge in cancer treatment, where the molecular heterogeneity of cancer necessitates highly individualized therapeutic strategies. Over the past decade, a diverse array of computational approaches have been developed to identify promising drug combinations for subsequent experimental validation. 58,59,178 . This section provides an overview and examples of computational methods used for predicting drug combination effects. Before the widespread adoption of machine learning, other computational methods were commonly used to predict drug combination effects. These methods have remained useful, especially when training data is limited. These include, for instance, mathematical models, stochastic search algorithms, and systems biology approaches 178 . Mathematical models explicitly model drug combination response dynamics based on factors like transcriptomic changes induced by the drugs 179,180 . Stochastic search algorithms iteratively explore combinations to optimize synergy 181,182 . Systems biology approaches model interactions within biological networks to predict synergistic drug effects 183–185 . While these methods can offer valuable insights, their effectiveness often depends on the accuracy of underlying assumptions and the availability of detailed biological knowledge 178 . More recently, machine learning techniques have gained popularity in drug combination prediction, driven by the increasing availability of data from high-throughput drug combination screens. These methods vary in how they approach the prediction task and the types of input features and training data used. Most models are based on supervised learning, where the algorithm learns to predict outcomes from labeled training data. In this context, the problem is typically framed as either a classification or regression task. Classification models aim to categorize drug combinations 41 Background as additive, synergistic, or antagonistic, while regression models aim to predict continuous outcomes, such as synergy scores or dose-response values. Input features in these models range from drug-specific characteristics like chemical structures (e.g., molecular fingerprints or graph-based features) and physicochemical properties, to biological data from the target system, such as gene expression profiles or copy number variations in cancer cell lines. Training data usually comes from large-scale experimental drug combination screens, with widely used datasets including NCI-ALMANAC 53 , DrugComb 186 , SYNERGxDB 187 , and O’Neil et al. study 188 . Various machine learning algorithms have been employed for drug combination prediction, including support vector machines 189 , random forests 190,191 , extreme gradient boosting machines (XGBoost) 192,193 and logistic regression 194 . Among these, popular ensemble tree-based models, such as random forests and XGBoost, have demonstrated strong predictive performance and are frequently applied in this context. For example, Sidorov et al. applied random forests and XGBoost to predict drug combination synergy. Their study developed separate models for each cell line, training the models to predict synergy scores from a large publicly available database based on various chemical features 192 . Random forests were also used in a study by Li et al. 190 , which incorporated a variety of features such as biomarkers and genetic data, including gene expression, copy number variation and methylation data, to predict drug combination synergy. Celebi et al. also used XGBoost to predict drug combination synergies using multi-omics features 193 . However, similar to neural networks, such models can be computationally intensive to train and challenging to interpret. Deep learning, a field of machine learning that uses multi-layered neural networks to capture complex patterns in data, has become increasingly popular for predicting drug combinations in recent years 195–201 . For instance, a study by Preuer et al. was among the first to apply deep learning to this task, utilizing both compound information and genomic data from cancer cell lines to predict drug combination synergy 202 . However, despite its potential, some studies have shown that deep learning does not consistently outperform other methods in predictive accuracy and can sometimes struggle with generalization to unseen drugs 58,203–205 . To objectively evaluate and advance computational methods for drug combination prediction, two crowdsourced DREAM (Dialogue for Reverse Engineering Assessment and Methods) challenges have been conducted 206,207 . The latest, the AstraZeneca-Sanger Drug Combination DREAM Challenge, evaluated 160 prediction methods on their ability to predict and classify drug synergy using blinded data from 910 drug combinations across 85 cancer cell lines 206 . The top-performing method utilized a random forest algorithm, integrating monotherapy data, drug target information, molecular features, and gene–gene interaction networks to predict synergy 190 . 42 Background The results from these challenges have shown that the selective incorporation of biological knowledge, rather than the choice of computational method alone, is key to achieving high predictive performance 206,208 . Despite the notable advancements in machine learning approaches for drug combination prediction, at the time of Publication I, a key limitation that remained largely unaddressed was the dose-dependence of drug combination effects. Most available methods at the time focused on predicting a single synergy score or categorizing combinations as synergistic, additive, or antagonistic, without accounting for how these effects vary with different doses. In practice, drug combinations can exhibit different effects across dose levels; a combination that is synergistic at one dose may be antagonistic at another. This highlighted the need for computational models capable of providing detailed, dose-specific predictions. 2.5 Machine learning and statistical inference Machine learning and statistical inference represent two related domains within data science, both dedicated to deriving meaningful insights from data. However, they differ in their primary objectives. Statistical inference focuses on drawing conclusions about a population from sample data by estimating model parameters and testing hypotheses, while machine learning centers on prediction and developing models that learn from data to generalize to new, unseen instances. I will use the following notation throughout this dissertation. Each observation consists of a vector of predictor variables x 2 Rd (also known as features or covariates) and an outcome variable y. The outcome can be continuous y 2 R (regression), binary y 2 (0, 1) (classification) or time-toevent y = {T 2 R+ , ± 2 (0, 1)} (survival analysis). The task is to estimate a function ŷ to make a good prediction of the output y, given the predictors x. To construct the prediction rules in a supervised manner, we use a set of measurements {(x1 , y1 ), (x2 , y2 )..., (xn , yn )}, commonly referred to as the training data, or simply as the sample in the context of statistical inference. Depending on the specific goals of the analysis, the learned prediction rules can be used for making inferences about the underlying data-generating process or to make predictions on unseen instances. Tensors are represented using bold calligraphic letters X , matrices using bold uppercase letters X, vectors using bold lowercase letters x, and scalars using regular lowercase letters x. The i th row of a matrix X is denoted as x i . All vectors are considered as column vectors; thus, for a vector a 2 Rn , a> a is a scalar and aa> is n £ n matrix, where > denotes the transpose operator. An inner product between two vectors a and b is expressed as ha, bi = a> b. I use the term interaction to refer to a statistical interaction term between two predictor variables, as commonly used in biomedical and 43 Background epidemiological research. When referring to interactions involving three or more variables, I specifically refer to these as higher-order interactions or by their order, such as a third-order interaction for interactions involving three variables. 2.5.1 Multiple regression Multiple regression encompasses a set of statistical techniques used to predict an outcome, or dependent variable y, by several independent or predictor variables x = (x1 , x2 , x3 , ..., xd ). Depending on the type of the outcome variable, commonly applied multiple regression techniques include linear regression (for continuous outcomes), logistic regression (for binary outcomes), and Cox proportional hazards regression (for time-to-event outcomes). These regression methods can be used for both purposes of statistical inference and prediction. When using these methods for statistical inference, the focus is on understanding the magnitude and direction of associations between the predictors and the outcome. In contrast, when using them as machine learning tools for building prediction models, the focus is on maximizing the accuracy of the model in predicting the outcome for new instances. Often, both objectives can be pursued simultaneously, allowing for the development of prediction models that also provide insights into the factors underlying the prediction. The most common multiple regression methods are based on a linear model, which assumes that the outcome is related to the predictors through a linear combination: g(ŷ(x)) = Ø0 + Ø1 x1 + Ø2 x2 + . . . + Ød xd (2.3) where g(·) represents a link function that connects the linear predictor to the outcome. The regression coefficients Ø i quantify the influence of each predictor on the outcome and Ø0 is an intercept term, also known as the bias term. The choice of link function varies depending on the type of regression model (Table 2.1). To express the linear regression model in vector form, it is convenient to include a constant variable 1 in the vector of predictor variables x = Method Link function Linear regression ŷ(x) = Ø> x Logistic regression Cox proportional hazards regression log( 1°PP( y(=y=1|1x)|x) ) = Ø> x log(h(t|x)) = log(h 0 (t)) + Ø> x Table 2.1. Variations of multiple regression models. 44 Background (1, x1 , x2 , ..., xd ) to account for the intercept term. The linear model can then be compactly written as an inner product: g(ŷ(x)) = Ø> x (2.4) where Ø = (Ø0 , Ø1 , Ø2 , ..., Ød )> is a vector containing the regression coefficients. The term Ø> x is commonly referred to as the linear predictor, representing the linear combination of the predictor variables x weighted by their respective coefficients Ø. This vector notation will be used throughout the following sections. In the field of statistical epidemiology and disease risk modeling, multiple regression techniques are widely recognized as standard methods due to their interpretability and ability to provide insights into the relationship between predictors and outcomes. In Publications II-IV of this dissertation, these models were employed to quantify associations between metabolomic biomarkers and disease risk, as well as to derive risk prediction models. Given the novelty of the UK Biobank metabolomics dataset, it was important to use these established regression techniques to provide a reliable reference point for future studies and ensure that the results could be meaningfully compared to existing research in the field. Logistic regression Logistic regression is a widely used form of multiple regression for modeling binary outcome variables y 2 {0, 1}. In statistical epidemiology, it is commonly used for modelling outcomes where time is not relevant, such as prevalent disease outcomes. Logistic regression uses the logistic function æ : R ! {0, 1}, æ(t) = 1+1e°t to convert a linear combination of input predictors into a probability ranging from 0 to 1: ŷ(x) = P(y = 1|x) = 1 > 1 + e°Ø x (2.5) where x is the vector of predictor variables, including a constant term for the intercept, and Ø is the vector of regression coefficients. Logistic regression establishes a linear relationship between the predictors and the log odds of the outcome through the logit function, which is the inverse of the logistic function: ✓ ◆ ŷ(x) logit(ŷ(x)) = log = Ø> x (2.6) 1 ° ŷ(x) Thus, logistic regression assumes that while the outcome itself is binary, the log odds of the outcome (i.e., the logit) has a linear relationship with the predictor variables. The parameters Ø in logistic regression are estimated using maximum likelihood estimation (MLE). As there is no closed-form solution, MLE 45 Background seeks to determine the parameter values that maximize the likelihood of observing the given data under the logistic regression model. The likelihood function L(Ø) is given by: L(Ø) = n Y i =1 ŷ i (x) yi (1 ° ŷ i (x))1° yi (2.7) where ŷ i (x) is the predicted probability for individual or observation i (eq. (2.5)). The log-likelihood function, which provides an equivalent but more convenient form for optimization, is obtained by taking the logarithm of the likelihood function: `(Ø) = log(L(Ø)) = n X ⇥ i =1 ⇤ yi log(ŷ i (x)) + (1 ° yi ) log(1 ° ŷ i (x)) (2.8) The MLE estimates of Ø are obtained by maximizing this log-likelihood function, typically using optimization methods such as Newton-Raphson or gradient descent. Cox proportional hazards regression Cox proportional hazards regression 32 , or simply Cox regression, is a widely used method in survival analysis for modeling time-to-event outcomes, such as the onset of a disease. The outcome in Cox regression consists of the time to event T 2 R+ and an event indicator ± 2 0, 1, where ± = 1 indicates the event has occurred, and ± = 0 indicates censoring, i.e. the event has not occurred by the end of the observation period. Unlike logistic regression, which models the probability of an event, Cox regression models the hazard function h(t|x), representing the instantaneous risk of the event at time t, conditional on survival up to that time. The Cox model is expressed as: h(t|x) = h 0 (t) exp(Ø> x) (2.9) where h0 (t) is the baseline hazard function, which varies over time but is independent of the predictors, and exp(Ø> x) is the partial hazard reflecting the effect of the predictor variables on the baseline hazard. This formulation allows the model to separate the time-dependent baseline hazard from the influence of the predictors. The term "proportional hazards" reflects the assumption that the predictors affect the hazard rate in a multiplicative manner, maintaining proportionality over time across individuals. Taking the logarithm of the hazard function reveals the linear relationship of the predictor variables: log(h(t|x)) = log(h 0 (t)) + Ø> x. (2.10) The parameters Ø are estimated using partial likelihood, as the full likelihood is complicated by the baseline hazard function h0 (t), which typically 46 Background remains unspecified. The partial likelihood focuses on the ordering of events rather than their exact timing, allowing for the estimation of the coefficients in a semi-parametric fashion without the need to specify the baseline hazard h0 (t): L(Ø) = Y i :± i =1 P h 0 (t) exp(Ø> x) j 2R (T i ) h 0 (t) exp(Ø > x) = Y i :± i =1 P exp(Ø> x) j 2R (T i ) exp(Ø > x) (2.11) where x i is the vector of predictor variables for individual i , T i is their observed event time, and R(T i ) denotes the risk set at time T i , consisting of individuals who are still at risk before the i th event. To obtain a more convenient formulation, the log-partial likelihood function is often used: !! X X > > (2.12) l(Ø) = log(L(Ø)) = Ø x i ° log exp(Ø x j ) j 2R (T i ) i :± i =1 The maximum partial likelihood estimation approach involves finding the parameter values Ø that maximize this log-partial likelihood function. As no closed-form solutions exists, the optimization is typically performed using methods such as Newton-Raphson or gradient descent. Statistical inference and interpretation In both logistic and Cox regression, the estimated coefficients Ø = (Ø1 , Ø2 , ..., Ød )> can be used for statistical inference and interpretation: • Logistic regression: The coefficients Ø i reflect the change in the log odds of the outcome associated with a one-unit increase in the predictor, assuming all other variables remain constant. When exponentiated, eØi represents the odds ratios, indicating the multiplicative change in the odds of the outcome for each one-unit increase in the predictor. • Cox proportional hazards regression: The coefficients Ø i reflect the change in the log hazard associated with a one-unit increase in the predictor variable, assuming all other variables remain constant. When exponentiated, eØi represents the hazard ratios, indicating the multiplicative change in the hazard for each one-unit increase in the predictor. To test the null hypothesis H0 : Ø i = 0 (indicating no effect of the predictor x i on the outcome), the Wald test 209 is typically used in both logistic and Cox regression. the Wald statistic W is calculated as: W= ✓ Ø̂ i se(Ø̂ i ) ◆2 (2.13) 47 Background where Ø̂ i is the estimated coefficient and se(Ø̂ i ) is its standard error. Under the null hypothesis, W follows a ¬2 distribution with 1 degree of freedom. For practical inference, the square root of W is typically used to obtain a z-statistic, which is approximately normal. The 100(1 ° Æ)% confidence interval for each estimated regression coefficient is given by: Ø̂ i ± zÆ/2 · se(Ø̂ i ) (2.14) where zÆ/2 is the critical value from the standard normal distribution. This facilitates hypothesis testing for the significance of predictors and interpreting their impact on the outcome. When interpreting the association results, potential sources of bias such as confounding, information bias, and selection bias must be considered 210,211 . Confounding arises when the association between a predictor and an outcome is distorted by a third variable, a confounder, which is associated with both the predictor and the outcome. Confounding can often be corrected for by including the confounder as a covariate in the model or through stratification. Information bias results from measurement errors in predictor or outcome variables, while selection bias can occur if the study sample is not representative of the target population. Such biases may lead to inaccurate estimates and incorrect conclusions about the true effect sizes. However, when the goal is to discover predictive associations rather than to make causal inferences, these biases are less critical. In this context, the presence of confounders or other biases does not necessarily undermine the model’s utility, provided it reliably predicts the outcome of interest 212 . Regularization When using multiple regression models for prediction, regularization becomes a pivotal technique to address the risk of overfitting. Overfitting arises when a model is excessively complex, capturing noise in the training data rather than the underlying patterns, which can hinder its performance on new data. Regularization addresses this issue by incorporating a penalty term to the objective function, constraining the magnitude of the regression coefficients Ø. This process results in simpler, more parsimonious models that improve the generalizability to unseen data. In the context of logistic and Cox proportional hazards regression, estimation of the regression coefficients Ø involves maximizing the log-likelihood or log-partial likelihood functions, respectively. This is equivalent to minimizing the negative log-(partial) likelihood °l(Ø) . Regularization modifies this objective function by adding a penalty term P(Ø), which imposes a constraint on the magnitude of the coefficients: Ø̂ = arg min ° l(Ø) + ∏P(Ø) Ø 48 (2.15) Background where ∏ ∏ 0 is a regularization parameter that determines the strength of the penalty. The penalty term can take different forms depending on the type of regularization. For instance, ridge regression 213 involves the use P of L2 regularizer P(Ø) = ||Ø||22 = di=1 Ø2i , while LASSO 214 (least absolute shrinkage and selection operator) regression incorporates L1 regularizer Pd P(Ø) = kØk1 = i=1 |Ø i |. The regularization parameter ∏ controls the trade-off between fitting the model to the training data and imposing a penalty on the complexity of the model. When ∏ = 0, the model reduces to the unregularized form, where the penalty has no influence. As ∏ increases, the penalty term exerts a stronger influence, resulting in greater shrinkage of the coefficients, which can reduce variance but potentially increase bias, commonly referred to as the bias-variance trade-off. Interactions Interactions in statistical modelling occur when the effect of one predictor on the outcome depends on the level of another predictor (Figure 2.4). This implies that the predictors jointly influence the outcome in a way that is not simply additive. Including interaction terms in a regression model allows for the exploration of more complex relationships and can provide insights into how different factors jointly affect the outcome. For instance, the biological processes giving rise to disease are often inherently complex, with interactions representing one form this complexity. Recognizing and appropriately modeling interactions can enhance both the predictive accuracy and interpretability of a regression model. In predictive modeling, accounting for interactions can lead to more accurate predictions by capturing complex relationships between predictors that would otherwise be missed if only main effects were considered. For example, in many machine learning models, interactions between variables are often implicitly captured through algorithms such as random forests or neural networks. However, in linear models, interactions need to be explicitly specified. To model interactions in a regression model, an interaction term is created by multiplying the two predictors of interest. The regression equation then includes this interaction term in addition to the main effects of the individual predictors: g(·) = Ø0 + Ø1 x1 + Ø2 x2 + Ø12 x1 x2 + ... (2.16) where g(·) is a link function appropriate for the model (e.g. logit for logistic regression, identity for linear regression) and Ø0 is the intercept. x1 and x2 are the predictor variables of interest, with Ø1 and Ø2 giving their respective main effects on the outcome. x1 x2 is the interaction term and Ø12 represents the effect of the interaction on the outcome. The model can be extended to include additional predictors and their interactions. 49 Background a No interaction b Positive interaction x2 c x2 0 1 2 0 1 2 Negative interaction 3 x2 0 1 2 y y y 1 0 0 0 −1 −2 −2 −2 −2 0 x1 2 −2 0 x1 2 −2 0 x1 2 Figure 2.4. Illustrative examples of different data scenarios depicting interaction effects in the context of a continuous outcome variable y (y-axis). a) The continuous predictor x1 (x-axis) and binary predictor x2 (colorcoded) both exhibit positive main effects on y, but no interaction effect is present. b) Similar to a), but with an additional positive interaction effect between x1 and x2 on y. c) Similar to a), but with an additional negative interaction effect between x1 and x2 on y. Modeling interactions becomes increasingly complicated as the number of predictors increases, posing a particular challenge in prediction models involving many variables. As the number of possible interaction terms increases quadratically with the number of predictors, include all potential interactions quickly becomes both computationally and statistically challenging. This challenge motivated the work in Publication V, where we developed an extension to the Cox proportional hazards regression model to effectively model interactions among predictor variables. This approach was developed based on concepts derived from factorization machines, a machine learning technique presented in the following section. 2.5.2 Factorization machines Factorization machines (FMs) provide a non-linear extension to standard multiple regression techniques that can capture comprehensive interactions among predictor variables. They were introduced by Steffen Rendle in 2010 215 as a method to generalize matrix factorization and regression models by efficiently modeling interactions in high-dimensional datasets. Initially, FMs were applied to recommender systems and click-through rate prediction, effectively modeling sparse and high-dimensional data common in these domains 215–219 . Standard formulation of factorization machines The central idea behind FMs is to model the interactions between pairs of variables using inner products of low-dimensional latent vectors (Figure 2.5a). In this framework, each predictor variable is associated with a 50 Background corresponding latent vector. The interaction between any two variables is then represented by the inner product of their latent vectors. This approach enables FMs to capture complex interaction patterns without explicitly enumerating all possible interactions, which would be computationally and statistically challenging in high-dimensional settings. Mathematically, the FM model can be expressed as: ŷ(x) = Ø> x + X 1∑ i 6= j ∑ d hp i , p j i x i x j (2.17) where h·, ·i denotes the inner product. The first term represents the linear effects of the predictor variables, analogous to standard multiple regression models, where Ø contains the regression coefficients for the predictors x. The second term accounts for all potential pairwise interactions between the predictor variables x i and x j . Instead of estimating these interaction effects Ø i j directly, FMs approximate them using the inner product of two latent vectors p i and p j : Ø i j º hp i , p j i = k X f =1 pi f · p j f (2.18) where p i 2 Rk represents the contribution of variable x i to k latent factors. The collection of these latent vectors forms the parameter matrix P 2 Rd £k . The rank k is a hyperparameter that defines the dimensionality of the latent factor space, typically chosen such that k ø d , thereby reducing the number of interaction parameters to estimate from d 2 to dk (Figure 2.5a). This methodology has two key advantages. First, estimation of the interaction parameters, which would be quadratic in the number of variables if modeled explicitly, is effectively reduced in complexity due to the use of low-dimensional latent vectors. This enables the estimation of interaction effects even in scenarios where explicit modeling would be infeasible due to the quadratic increase of potential interactions. Second, the methodology allows for reliable parameter estimation even in cases of sparse data. This means that co-occurrence of predictors x i and x j does not need to be directly observed to learn the interaction effect Ø i j ; the latent vectors p i and p j can be learned through interactions with other predictors and the inner product still gives Ø i j . The latter feature proves particularly valuable in predicting drug combination responses in Publication I, where it facilitates inferences about the effects of new drug combinations based on prior observations of their individual components elsewhere in the training data. FMs are applicable to a wide range of prediction tasks, including both regression and classification. The choice of the loss function, which quantifies the error between predicted and actual values, depends on the specific task at hand. Given a loss function L and L2 (ridge) regularization, the regularized objective function for FMs can be expressed as: 51 Background n arg min Ø,P 1X L (yi , ŷ(x i )) + ∏1 ||Ø||22 + ∏2 ||P||22 n (2.19) i =1 where Ø1 > 0 and Ø2 > 0 are regularization parameters for the linear effects Ø and the matrix of factor vectors P giving the effects for the interaction terms, respectively. The model parameters are typically estimated using gradient-based methods, such as stochastic gradient descent (SGD). Factorization machines were initially developed for regression and classification tasks, using loss functions like a squared loss for regression or logit or hinge loss for classification. In Publication V, we extend factorization machines to the context of survival analysis and show their ability to improve disease risk prediction compared to standard linear Cox proportional hazards regression model. Higher-order factorization machines Higher-order factorization machines (HOFMs) extend the standard factorization machines (FMs) by capturing interactions beyond pairwise relationships 62,215 . While standard FMs are limited to modeling individual effects and pairwise interactions among predictor variables, HOFMs are designed to also account for higher-order interactions involving three or more predictors. The core idea of HOFMs is to decompose interactions among three or more variables into products of their respective latent vectors, similar to the approach used in standard FMs for pairwise interactions (Figure 2.5b). Mathematically, the model for an HOFM of order m can be expressed as: ŷ(x) = Ø> x + X (2) hp(2) i , p j i x i x j + ... + 1∑ i 6= j ∑ d X ) ( m) ) hp(im , p i 2 , ..., p(im i x i 1 x i 2 ...x i m 1 m (2.20) 1∑ i 1 <...< i m ∑ d where the first term corresponds to the linear effects of the predictor variables, analogous to standard multiple regression models, where Ø contains the regression coefficients for the predictors x. The second and higher-order interaction effects are estimated in a factorized form: Ø i j º hp i , p j i Ø i 1 ,i 2 ,...,i t º hp(it1) , p(it2) , ..., p(itt) i (2.21) (2.22) where p(it) 2 Rk t denotes the tth order factor weight of feature i and k t 2 N+ is the hyperparameter defining the rank of the factorization. The generalized inner product h·, ..., ·i over t vectors u i 2 Rk , i = 1, ..., t is defined as: k X hu1 , u2 , ..., u t i = u 1 f u 2 f ...u t f (2.23) f =1 52 Background which generalizes the usual pairwise inner product hu, vi = u> v to sets of t vectors. The latent vectors for each interaction order t are collected into matrices P( t) 2 Rd £k t . Hence, the number of parameters to estimate for each degree t reduces from d t to dk t (Figure 2.5b). While the rank k t 2 N+ can be uniquely set for each order t, it is often convenient in practice to use a uniform rank k across all orders k 1 = k 2 = ... = k m . Similar to standard FMs, HOFMs are applicable to various prediction tasks, including regression and classification, depending on the choice of the loss function L . The L2 regularized objective function for HOFMs can be expressed as: n m i =1 t=2 X 1X L (yi , ŷ(x i )) + ∏1 ||Ø||22 + ∏ t ||P( t) ||22 Ø,P(1) ,P(2) ,...,P(m) n arg min (2.24) where Ø1 , Ø2 , ..., Øm > 0 are regularization parameters. Despite the conceptual higher-order extension presented in the original work by Rendle 215 , efficient estimation of HOFMs was only made feasible with the dynamic programming approach proposed by Blondel et al. 62 . This approach allowed for the efficient calculation of interaction terms and their gradients, thus enabling the use of stochastic gradient algorithms for training HOFMs, notably improving their computational feasibility. Factorization machines and their higher-order extensions have demonstrated state-of-the-art performance in fields such as recommender systems and click-through rate prediction for digital advertising 216–219 . However, their application in other domains, particularly in biomedical research, has been limited. In Publication I, we apply higher-order factorization machine to model complex interactions in inherently high-dimensional drug combination dose-response data and show their ability to perform accurate predictions in various prediction scenarios. 53 Background a β ∈ Rd 2 P ∈ Rd×k d k d pi βi,j ≈ pi , pj βi,j ≈ . k d b B ∈ Rd 3 (3) (3) (3) βi1 ,i2 ,i3 ≈ pi1 , pi2 , pi3 d βi1,i2,i3 pj d ≈ . P(3) ∈ Rd×k (3) p i1 (3) pi2 d (3) pi3 d k Figure 2.5. a) Standard factorization machines estimate all pairwise interaction effects among predictors using factorized parametrization β i j = ⟨p i , p j ⟩. (b) Higher-order factorization machines extend the factorization to interactions of order t, β i 1 ,i 2 ,...,i t ≈ ⟨p(it1) , p(it2) , ..., p(itt) ⟩ (here, illustrated for t = 3). d denotes the number of predictors and k is a hyperparameter de ning the rank of the factorization. The rank of the factorization is typically much lower than the number of predictor variables ( k d ), enabling estimation of the interaction effects even in the presence of many predictors. 54 3. Predictive modelling of drug combination effects (Publication I) Given the vast number of possible drug and dose combinations, computational methods are essential for guiding experimental efforts toward the most promising candidates for further pre-clinical and clinical validation. Although high-throughput screening has produced extensive datasets of drug combination responses, notable gaps persist in the combinatorial drug space. In Publication I, we introduced comboFM, a novel machine learning framework that models the dose-specific effects of drug combinations. Unlike previously proposed machine learning approaches that primarily focused on synergy, comboFM provides detailed predictions of drug responses across a range of doses. 3.1 Foundations of comboFM comboFM builds upon representing the dose-specific drug combination response data as a higher-order tensor X , indexed by drugs, their concentrations, and cell lines (Figure 3.1). Additionally, comboFM allows for the integration of auxiliary data such as chemical features of the drugs, or genomic or transcriptomic features of the cell lines. comboFM employs higher-order factorization machines (HOFMs) 62 to X, Genomic descriptors (e.g. gene expression) Chemical descriptors (e.g. fingerprint) Cell lines …,0,1,0,0,0,1,0,0,… Drug 1 Chemical descriptors (e.g. fingerprint) Dr ug 2 …,0,1,0,0,0,1,0,0,… Drug 2 concentration Drug 1 concentration Figure 3.1. Dose response matrices from experimental measurements are arranged into a fth-order tensor X , indexed by drugs, concentrations, and cell lines, along with additional descriptors. Reprinted from Publication I. 55 Predictive modelling of drug combination effects (Publication I) model interactions within the tensor. The data tensor is flattened into a matrix X, with each row vector x representing an entry from the tensor. The method then estimates regression weights Ø for the individual predictors, along with weights for each higher-order feature interaction using factorized parameterization Ø i 1 ,i 2 ,...,i t º hp(it1) , p(it2) , ..., p(itt) i (Section 2.5.2). This approach avoids the computational and statistical difficulties related to direct estimation of the weight tensor for the higher-order interactions B (Figure 2.5b). Moreover, by coupling weights through latent factors, the approach enables effective learning even in scenarios of sparsely populated data tensors. This approach of first predicting the entire dose–response matrices allows one to utilize the detailed data they contain. Subsequently, drug combination synergy can be quantified over the full predicted matrix using various models, providing a more comprehensive understanding of potentially synergistic effects. Furthermore, understanding the effects of drug combinations at both the dose-response and synergy levels offers valuable guidance for precision medicine efforts, as lower doses tend to be favored in clinical settings due to better tolerability. 3.2 Drug combination dataset comboFM was trained and tested using data from a cancer cell line pharmacogenomic screen from the NCI-ALMANAC study, the largest available drug combination dataset at the time. This dataset comprises over 5000 combinations of around 100 FDA-approved oncology drugs tested against 60 cancer cell lines in several concentration pairs. To manage computational complexity, we considered a subset of this data containing 50 randomly chosen drugs. These drugs had been screened in 617 unique combinations across all the 60 cell lines derived from 9 tissue types. This subset included 333,180 measurements of drug combination responses and 222,120 monotherapy response measurements, recorded as the percentage growth of the cell lines. As additional features for the model training, we also incorporated information related to the molecular fingerprints of the drug compounds and gene expression profiles of the cancer cell lines. 3.3 Evaluation settings comboFM was tested in three practical prediction scenarios. The first scenario involved predicting missing entries in partially observed dose–response matrices. The second scenario focused on predicting entirely unmeasured dose–response matrices for known drug combinations in new cell lines. The third and most difficult scenario, involved predicting responses 56 Predictive modelling of drug combination effects (Publication I) for completely new drug combinations. To evaluate the performance of comboFM in predicting drug combination responses and to tune model hyperparameters, we implemented a 10×5 nested cross-validation procedure across all three scenarios. The interaction order was set to m = 5, corresponding to the dimensionality of the underlying tensor. To assess the potential advantages of incorporating higher-order interactions, we also conducted analyses with second-order FMs and first-order FMs (equivalent to ridge regression). Additionally, to further evaluate the predictive performance of comboFM, we applied random forest (RF) as another reference model. RF is a commonly used machine learning model that operates on a different learning principle. It has previously been succesfully applied for predicting drug combinations, including the winning method in the recent AstraZeneca-Sanger drug combination prediction DREAM Challenge 190,206 . [D2 ] c x x x x x [D2 ] %-growth %-growth b x x x x x x x x x C [D2 ] %-growth a x x x x x x x x x C C [D1 ] [D1 ] [D1 ] Predicting new doseresponse matrix entries Predicting new doseresponse matrices Predicting new drug combinations Figure 3.2. Three prediction scenarios are considered: a) predicting missing entries in partially tested dose response matrices, b) predicting a complete dose response matrix in a new cell line, and c) making predictions for a completely new drug combination not tested so far in any cell line. Reprinted from Publication I. 3.4 Accurate predictions of drug combination effects The 5th-order comboFM achieved the highest predictive accuracy across the compared methods in all prediction scenarios, with Pearson correlations between predicted and observed dose-responses ranging from 0.95 to 0.97 (Figure 3.3). The predictive performance remained consistent across different tissue types and drug classes, demonstrating its robustness across various drug classes and biological contexts (Figure 3.4). Importantly, comboFM generalized well to new drug combinations, enabling systematic prediction of dose–response matrices for previously untested combinations. This could provide practical guidance on repositioning the drugs into new combinations. Furthermore, comboFM accurately recovered synergy 57 Predictive modelling of drug combination effects (Publication I) scores from predicted dose–response matrices in all evaluated scenarios, outperforming the compared methods. 3.5 Experimental validation of predicted drug combinations In-house experimental laboratory validation confirmed the robustness of comboFM’s predictions, even under different assay conditions from the original NCI-ALMANAC dataset. For instance, the experimental validation verified a novel synergy between proteasome inhibitor bortezomib and anaplastic lymphoma kinase inhibitor crizotinib in lymphoma cells, as predicted by comboFM. Additionally, comboFM accurately predicted the efficacy of combinations involving the histone deacetylase (HDAC) inhibitor romidepsin in targeting BRAF-mutant melanoma cell lines, which was also confirmed experimentally. While many of the drugs in these romidepsin-based combinations have previously been explored in other combinations for melanoma treatment 220,221 , the specific combinations predicted by comboFM remain untested in melanoma and require further investigation. Each of these predicted inhibitors has shown promise individually against melanoma in pre-clinical or clinical studies, supporting their potential use in combination therapies. This validation highlights the generalizability and practical applicability of comboFM. 3.6 Conclusions In conclusion, due to the high costs associated with experimental drug combination screening, comboFM presents a time- and cost-efficient solution for prioritizing promising combinations for further pre-clinical and clinical investigation. Its robust and accurate predictions provide a valuable tool to accelerate the development and extension of combination therapies in precision oncology. This holds potential to advance the clinical use of drug combination therapies to address drug resistance and increase treatment efficacy. Several of the combinations predicted by comboFM are already undergoing clinical trials, highlighting the method’s translational potential. 58 Predictive modelling of drug combination effects (Publication I) a Predicting new dose−response matrix entries RMSE : 9.86 RSpearman : 0.91 100 y = 3.4 + 0.95 x 0 −100 NCI ComboScores, RPearson : 0.92 −200 −200 b RMSE : 10.91 RPearson : 0.97 Predicted response, %−growth Predicted response, %−growth 200 −100 0 100 200 RPearson : 0.97 100 y = 6.8 + 0.91 x RSpearman : 0.91 0 −100 −200 −100 0 100 Measured response, %−growth Measured response, %−growth comboFM−5 RF 200 Predicting new dose−response matrices RMSE : 10.39 RMSE : 12.23 RPearson : 0.97 RSpearman : 0.91 100 Predicted response, %−growth Predicted response, %−growth 200 y = 3.4 + 0.95 x 0 −100 NCI ComboScores, RPearson : 0.84 −200 −200 c NCI ComboScores, RPearson : 0.83 −200 200 −100 0 100 200 RPearson : 0.96 100 y = 7.9 + 0.89 x RSpearman : 0.90 0 −100 NCI ComboScores, RPearson : 0.69 −200 −200 200 −100 0 100 Measured response, %−growth Measured response, %−growth comboFM−5 RF 200 Predicting new drug combinations RMSE : 13.04 RMSE : 15.44 RPearson : 0.95 RSpearman : 0.88 100 Predicted response, %−growth Predicted response, %−growth 200 y = 5.1 + 0.93 x 0 −100 NCI ComboScores, RPearson : 0.72 −200 −200 −100 0 100 200 200 RPearson : 0.93 100 y = 12 + 0.84 x RSpearman : 0.86 0 −100 NCI ComboScores, RPearson : 0.48 −200 −200 −100 0 100 Measured response, %−growth Measured response, %−growth comboFM−5 RF 200 Figure 3.3. Predictive performance of 5th order comboFM-5 (blue) and random forest (RF; orange) by comparing the measured and predicted dose response values across the three prediction scenarios: a) predicting new entries in dose response matrices, b) predicting entire dose response matrices, and c) predicting new drug combinations. Performance metrics are reported as root mean squared error (RMSE), Pearson correlation and Spearman correlation for the drug combination response predictions, along with Pearson correlations of the synergy scores (NCI ComboScores) computed from the predicted dose response matrices. Reprinted and modi ed from Publication I. 59 Predictive modelling of drug combination effects (Publication I) a d Predicting new dose−response matrix entries 1.0 1.0 Tissue type 0.8 0.6 0.5 comboFM−5 b Breast (6) CNS (7) Colon (6) Haematological (6) Melanoma (9) NSC Lung (10) Ovarian (6) Prostate (2) Renal (8) 0.7 Pearson correlation Pearson correlation 0.5 Drug classes 0.0 Chemo − Chemo (226) Chemo − Other (122) Other − Other (9) Targeted − Chemo (190) Targeted − Other (32) Targeted − Targeted (38) comboFM−5 comboFM−2 comboFM−1 RF e 1.0 0.9 Tissue type 0.8 Breast (6) CNS (7) Colon (6) Haematological (6) Melanoma (9) NSC Lung (10) Ovarian (6) Prostate (2) Renal (8) 0.7 0.5 comboFM−5 Pearson correlation Pearson correlation 0.6 0.5 RF Chemo − Chemo (226) Chemo − Other (122) Other − Other (9) Targeted − Chemo (190) Targeted − Other (32) Targeted − Targeted (38) 0.0 Drug classes comboFM−1 comboFM−2 comboFM−5 comboFM−2 comboFM−1 RF f Predicting new drug combinations 1.0 1.0 Tissue type Breast (6) CNS (7) Colon (6) Haematological (6) Melanoma (9) NSC Lung (10) Ovarian (6) Prostate (2) Renal (8) comboFM−5 comboFM−2 comboFM−1 RF 0.5 0.9 Pearson correlation Pearson correlation 0.5 0.6 0.7 RF Predicting new dose−response matrices 0.8 −0.5 comboFM−1 1.0 c comboFM−2 0.9 Chemo − Chemo (226) Chemo − Other (122) Other − Other (9) Targeted − Chemo (190) Targeted − Other (32) Targeted − Targeted (38) comboFM−5 0.0 Drug classes comboFM−2 comboFM−1 RF Figure 3.4. Predictive performance of 5th order comboFM-5, 2nd order comboFM-2 and 1st order comboFM-1, along with random forest (RF), evaluated across tissue types (a-c) and drug classes (d-f) under three prediction scenarios: a, d) predicting new entries in dose response matrices; b, e) predicting entire dose response matrices, and c, f) predicting new drug combinations. In the box plots, horizontal lines indicate the median Pearson correlation, while the lower and upper hinges correspond to the 25th and 75th percentiles. The whiskers show the highest and lowest values within 1.5 times the interquartile range (IQR), and points outside the whiskers are outlier predictions. Reprinted from Publication I. 60 4. Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) Publications II-IV collectively aimed to investigate the role of NMR metabolomic biomarkers in predicting disease risks. These studies leveraged a uniquely large novel metabolomics dataset from the UK Biobank, the largest of its kind to date. The scale of this dataset, combined with the extensive range of health outcome data, facilitated a comprehensive evaluation of the associations between individual metabolomic biomarkers and disease risk, offering valuable new insights into the broader applicability of these biomarkers for predicting the future onset of various health outcomes. 4.1 Datasets Publications II-IV used NMR metabolomic data from three extensive prospective cohort studies, including UK Biobank, Estonian Biobank, and THL Biobank. The UK Biobank served as the primary dataset for discovery analyses, while the THL Biobank was used for replication in Publications III and IV, and the Estonian Biobank for replication in Publication IV. These cohorts comprise adults from European countries, each with distinct recruitment methods, blood sampling protocols, timeframes, age distributions, and approaches to capturing outcomes via electronic health records. Ethical approval was granted by local ethics committees, and all participants provided written informed consent. Details of these datasets are provided in Publications II-IV and summarized below. 4.1.1 UK Biobank The UK Biobank is an internationally accessible biomedical database consisting of half a million participants aged 40–69 at the time of enrollment. Recruitment, which took place between 2006 and 2010, was conducted on a voluntary basis across 22 assessment centers located in Scotland, England, and Wales. The study continuously collects follow-up 61 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) data through electronic health records, covering hospital admissions, primary care visits, and mortality records, which are regularly updated. The UK Biobank study received ethical approval from the North West MultiCentre Research Ethics Committee, and all participants provided written informed consent. The biomarker profiling of plasma samples using NMR spectroscopy was approved and data accessed under the approval of UK Biobank project ID 30418. 4.1.2 THL Biobank The THL Biobank data used in Publications III–IV comprises five population cohorts (National FINRISK Studies from 1997, 2002, 2007, and 2012, along with the Health 2000 Survey), totaling approximately 35,000 individuals. Each of these cohorts represent a distinct random sample of individuals aged 25–98 across Finland. Disease outcome data are linked from national hospital registries and reimbursement records, with followup data available until 2017. The THL Biobank cohorts were approved by the Coordinating Ethical Committee of the Helsinki and Uusimaa Hospital District, Finland, and the data were accessed under research application number BB2016_86. 4.1.3 Estonian Biobank The Estonian Biobank (EBB) comprises around 210,000 individuals, representing approximately 20% of the adult Estonian population. The recruitment was conducted between 2002 and 2022 on a voluntary basis. The EBB database is routinely updated through integration with multiple national registries, hospital databases, and the national health insurance fund’s database, which contains detailed records of treatments and service billing information. The study was approved by the Estonian Committee on Bioethics and Human Research, and data access was granted under research approval number 1.1-12/2770. 4.1.4 NMR metalomic biomarker profiling The metabolomic biomarker data in these cohorts was generated by the NMR platform from Nightingale Health. The platform quantifies 249 metabolomic measures from a blood sample in one experimental assay. The biomarker panel is designed for accurate quantification in a highthroughput manner, and thus primarily covers molecules with high abundance in the blood. A majority of the biomarkers reflect lipoprotein metabolism, measuring the lipid concentrations and composition in 14 lipoprotein subclasses. These include total cholesterol, free cholesterol, cholesterol esters, triglycerides, phospholipids, and total lipid concentra- 62 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) tion within each subclass. The panel also covers the absolute concentration and ratios of abundant fatty acids, glycolysis related metabolites, and small molecules, such as ketone bodies and amino acids. Apolipoproteins A1 and B, as well as two inflammatory protein measures, glycoprotein acetyls and albumin, are also quantified due to their high concentration in blood. The measurement of 37 of these biomarkers were certified for clinical use in Europe at the time of Publication II-IV. The metabolomic biomarker data have been made available through the UK Biobank resource, with the measurements carried out and data released in three phases. 4.2 Predictive modelling of severe infectious disease and COVID-19 risk using metabolomic biomarkers (Publication II) The coronavirus disease 2019 (COVID-19) pandemic presented a global health threat at the time, affecting healthcare systems and societies worldwide. Protecting individuals most vulnerable to severe and potentially fatal outcomes was central to public health policies, with more stringent social distancing and other preventive measures recommended for highrisk groups. While older age and chronic health conditions were recognized as major risk factors 60,61 , the variation in individual susceptibility was not fully understood. Insights into molecular factors predisposing individuals to severe infectious diseases could help elucidate why certain groups were at an increased risk. The first wave of COVID-19 swept through Europe in the spring of 2020, coinciding with the completion of the first tranche of metabolomic biomarker measurements from the UK Biobank. This motivated our investigation in Publication II to examine whether NMR metabolomic biomarkers could be associated with severe outcomes in infectious diseases, including COVID-19. Given the limited availability of early COVID-19 data within the UK Biobank, we used hospitalization for pneumonia as a proxy outcome to derive a risk prediction score, which was subsequently tested using COVID-19 cases. 4.2.1 Study setting and methodology Using data from the UK Biobank, we analyzed metabolomic biomarkers from plasma samples collected between 2006 and 2010, approximately a decade prior to the onset the COVID-19 pandemic. Among 92,725 individuals with available COVID-19 data linkage at the time, there were 652 PCR-confirmed positive cases from hospitalized individuals, considered as severe cases in this study. As the initial data on COVID-19 was limited, we also included analyses of severe pneumonia outcomes. Among 105,146 participants with complete metabolomic data and no previous diagnosis of pneumonia, 2,507 pneumonia cases were recorded in hospital or death 63 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) registries, inferred as severe cases. Since all COVID-19 cases occurred nearly a decade after the blood sampling at the beginning of the study, time-resolved outcomes for COVID19 were not available. Therefore, to ensure methodological consistency between the pneumonia and COVID-19 analyses, we employed logistic regression for biomarker association testing and for developing a multibiomarker risk prediction score. All statistical models were adjusted for age, sex, and UK Biobank assessment center. 4.2.2 Associations of individual metabolomic biomarkers with severe pneumonia and COVID-19 Analysis of the individual biomarkers revealed several significant associations with severe pneumonia. For example, higher levels omega-6 and omega-3 fatty acids, cholesterol, branched-chain amino acids, histidine, and albumin were associated with reduced risk (Figure 4.1). In contrast, elevated levels of saturated and monounsaturated fatty acids, as well as glycoprotein acetyls (GlycA), a marker of low-grade inflammation, were associated with increased risk. A notably similar pattern of associations was observed for severe COVID-19. Higher levels of omega-3 and omega-6 fatty acids and albumin were linked to protective effects, while elevated GlycA concentrations were associated with increased risk (Figure 4.2). In addition to the discovered highly similar overall association pattern between COVID-19 and pneumonia, novel findings from these results include the lower concentrations of certain amino acids and fatty acids associated with increased risk for both severe pneumonia and severe COVID-19. Collectively, these findings suggest that alterations in metabolomic biomarkers may reflect an underlying susceptibility to severe outcomes from COVID19 and pneumonia in the general population. However, the observational nature of this study limits our ability to establish whether these biomarkers causally contribute to disease risk or simply reflect broader underlying risk factors. The observed biomarker associations are similar to those previously reported for all-cause mortality 30 , indicating that these metabolomic profiles may reflect broader frailty rather than being specific to infectious diseases. 4.2.3 Multi-biomarker score stratifies the risk of severe infectious diseases Given that the metabolomic biomarkers are measured simultaneously in a single NMR measurement, we sought to determine whether combining these biomarkers into a single risk score could capture the risk even more strongly. In light of the consistent association patterns between severe pneumonia and COVID-19, as well as their previously reported shared risk factors 222 , we developed a risk prediction score for severe pneumonia and 64 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) p value < 0.001 p value ≥ 0.001 Amino acids Lipoprotein lipids Total−C VLDL−C LDL−C HDL−C Triglycerides 0.8 1.0 1.2 1.4 1.6 1.8 Apolipoproteins ApoB ApoA1 ApoB/ApoA1 0.8 0.8 1.0 1.2 1.4 1.6 1.8 1.2 1.4 1.6 1.8 1.0 1.2 1.4 1.6 1.8 1.0 1.2 1.4 1.6 1.8 Glycolysis metabolites 1.0 1.2 1.4 1.6 1.8 Glucose Lactate 0.8 1.0 Fluid balance Creatinine Albumin 0.8 0.8 1.0 1.2 1.4 1.6 Fatty acid ratios Omega−3 % Omega−6 % PUFA % MUFA % SFA % DHA % PUFA/MUFA Omega−6/Omega−3 Fatty acids Total fatty acids Omega−3 Omega−6 PUFA MUFA SFA DHA Alanine Glycine Histidine Isoleucine Leucine Valine Phenylalanine Tyrosine Total BCAA 1.8 GlycA Glycoprotein acetyls Inflammation 0.8 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in biomarker concentration Multibiomarker score 0.8 1.0 1.2 1.4 Infectious Infectious disease score disease score 1.6 1.8 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in biomarker concentration 0.8 1.0 1.2 1.4 1.6 1.8 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score Figure 4.1. Associations of metabolomic biomarkers and multi-biomarker infectious disease score to future risk of severe pneumonia, in terms of odds ratios (N = 105 146, 2507 events). Associations are shown for 37 clinically validated biomarkers measured by the NMR platform. Odds ratios represent the change in risk per 1-SD increase in biomarker levels. Models are adjusted for age, sex, and UK Biobank assessment center. Horizontal bars represent 95% con dence intervals. Abbreviations: PUFA, polyunsaturated fatty acids; MUFA, monounsaturated fatty acids; SFA, saturated fatty acid; DHA, docosahexaenoic acid; BCAA, branched-chain amino acids. Reprinted and modi ed from Publication II. evaluated its performance in predicting both pneumonia and COVID-19. Using 50% of the study population for model training, we applied logistic regression with LASSO regularization, employing 5-fold cross-validation to tune the regularization parameter. The resulting risk score, based on a weighted sum of 25 biomarkers, was termed "infectious disease score" and subsequently tested in the remaining 50% of the study population. The multi-biomarker infectious disease score demonstrated stronger associations with the severe infectious disease outcomes than any individual biomarker. For severe pneumonia, the score had an odds ratio (OR) of 1.7 (95% CI 1.6–1.8) per one standard deviation (SD) increment (Figure 4.1). Individuals in the highest quintile of the score had nearly four times the risk of developing severe pneumonia compared to those in the lowest 65 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) p value < 0.001 p value ≥ 0.001 Amino acids Lipoprotein lipids Total−C VLDL−C LDL−C HDL−C Triglycerides Alanine Glycine Histidine Isoleucine Leucine Valine Phenylalanine Tyrosine Total BCAA 0.8 1.0 1.2 1.4 Apolipoproteins ApoB ApoA1 ApoB/ApoA1 1.0 1.2 1.4 1.2 1.4 1.0 1.2 1.4 1.0 1.2 0.8 1.0 1.2 Glycolysis metabolites 1.4 Glucose Lactate 0.8 1.0 Fluid balance Creatinine Albumin 0.8 0.8 1.0 1.2 Inflammation 1.4 Glycoprotein acetyls GlycA Fatty acid ratios Omega−3 % Omega−6 % PUFA % MUFA % SFA % DHA % PUFA/MUFA Omega−6/Omega−3 0.8 Fatty acids Total fatty acids Omega−3 Omega−6 PUFA MUFA SFA DHA 0.8 1.4 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in biomarker concentration Multibiomarker score 0.8 1.0 1.2 Infectious Infectious disease score disease score 1.4 0.8 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in biomarker concentration 1.0 1.2 1.4 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score Figure 4.2. Associations of metabolomic biomarkers and multi-biomarker infectious disease score to future risk of severe COVID-19, in terms of odds ratios (N = 92 725; 652 events). Associations are shown for 37 clinically validated biomarkers measured by the NMR platform. Odds ratios represent the change in risk per 1-SD increase in biomarker levels. Models are adjusted for age, sex, and UK Biobank assessment center. Horizontal bars represent 95% con dence intervals. Abbreviations: PUFA, polyunsaturated fatty acids; MUFA, monounsaturated fatty acids; SFA, saturated fatty acid; DHA, docosahexaenoic acid; BCAA, branched-chain amino acids. Reprinted and modi ed from Publication II. quintile (OR 3.8, 95% CI 3.0–4.7). The infectious disease score was also significantly associated with severe COVID-19 outcomes, with an OR of 1.4 (95% CI 1.3–1.5) per SD increment (Figure 4.2). Individuals in the highest quintile had almost three times the risk of severe COVID-19 compared to those in the lowest (OR 2.9, 95% CI 2.1–3.8), despite the decade long time lag from blood sampling until COVID-19. Notably, to mirror the time gap to the COVID-19 pandemic, when the analysis of severe pneumonia was restricted to events occurring 7–11 years after blood sampling, the association magnitude for severe pneumonia (OR 2.6, 95% CI 1.7–3.9; highest vs. lowest quintile) was found to be comparable to that of severe COVID-19 (Figure 4.3). However, to effectively screen for the susceptibility to severe COVID-19, 66 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) a b Mimicking the decade lag to the COVID−19 pandemic Mimicking a preventative screening scenario carried out today Within 7 years from blood sampling (943 events) 7−11 years from blood sampling (307 events) Within 2 years from blood sampling (162 events) 2−11 years from blood sampling (1088 events) 1 2 3 4 5 6 7 10 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score 1 2 3 4 5 6 7 10 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score Figure 4.3. Association of the infectious disease score with long-term and shortterm risk for severe pneumonia in the UK Biobank test set. a) Odds ratios for severe pneumonia events that occurred within the rst 7 years after the blood sampling, compared those occurring 7 11 years after blood sampling. b) Odds ratios for severe pneumonia events that occurred within the rst two years after blood sampling and those occurring after the rst two years. Models are adjusted for age, sex, and assessment centre. Odds ratios are presented as comparisons between individuals in the highest and lowest quintiles of the infectious disease score. Reprinted and modi ed from Publication II. a strong association with short-term risk would be required. However, such data were not available for COVID-19 in the UK Biobank. When the analysis of severe pneumonia was limited to events that occurred within the first two years after blood sampling, the short-term risk increase was considerably stronger. In this 2-year follow-up analysis, individuals in the highest quintile of the infectious disease score were over seven times more likely to develop severe pneumonia compared to those in the lowest quintile (OR 7.9, 95% CI 4.1-15.6). If a similar short-term risk increase applied to COVID-19, this score might prove useful in identifying highrisk individuals for severe COVID-19. However, the lack of metabolomic biomarker data from blood samples taken shortly before the pandemic prevented us from directly assessing short-term COVID-19 susceptibility. 4.2.4 Conclusions In conclusion, this study demonstrated that a signature of metabolomic blood biomarkers, collected a decade prior to the COVID-19 pandemic, is associated with increased susceptibility to severe pneumonia and COVID-19. This was the first study to show that many of these biomarkers, previously linked primarily to cardiometabolic diseases, were also associated with the risk of severe infectious disease outcomes. These findings suggest that metabolomic biomarkers should not be viewed solely as biomarkers of cardiometabolic risk but may also reflect the susceptibility to infectious diseases and potentially other conditions. The multi-biomarker infectious disease score developed in this study 67 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) was particularly highly predictive of severe pneumonia. Notably, the associations observed with the multi-biomarker score were stronger than those reported for many pre-existing health conditions, such as diabetes and obesity 222 . Individuals in the highest quintile of the score had over sevenfold increase in short-term pneumonia risk compared to those in the lowest quintile, providing a strong basis for identifying individuals at high risk. If a similar short-term risk elevation had applied to COVID19, metabolomic biomarker profiling could have complemented existing methods for identifying high-risk individuals also for severe COVID-19. However, the introduction of COVID-19 vaccines rapidly changed the public health needs at the time these findings were published. Regardless of the translational applications, these results offer new insights into how metabolomic biomarkers associate with susceptibility to severe COVID-19 and other infectious disease outcomes. 4.3 Systematic characterization of the associations of metabolomic biomarkers across common diseases (Publication III) In Publication III, the aim was to catalogue the associations of NMR metabolomic biomarkers with the risk of a wide range of common diseases. This extends beyond the infectious diseases covered in Publication II and the cardiometabolic focus that has previously prevailed in the research on these biomarkers. Leveraging the uniquely large sample size and extensive health outcome data available in the UK Biobank, this study sought to elucidate the broader associations of these biomarkers on disease susceptibility, covering over 700 diseases. Prior research had characterized the associations of the inflammatory biomarker GlycA, one of the most prominent biomarkers measured by NMR, in relation to over 400 common diseases in a cohort of 11,861 individuals 144 . The present study, with a sample size ten times larger, is the first to systematically evaluate all biomarkers measured by the NMR platform and across many new disease endpoints. 4.3.1 Study setting and methodology We systematically analyzed the associations of 249 NMR biomarker across 717 incident diseases, 648 prevalent diseases, and 77 causes of death. Disease endpoints were defined using three-character ICD-10 codes (International Classification of Diseases, 10th Revision), derived from hospital episode statistics and death records. The analysis included all diseases with at least 50 occurrences within 10 years following blood sampling (for incident and mortality outcomes) or those documented in prevalent records up to 25 years prior to sampling. Association testing for incident and mor- 68 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) tality outcomes was conducted using Cox proportional hazards regression, while logistic regression was applied for prevalent disease outcomes. All models were adjusted for age, sex, and UK Biobank assessment center, with age used as the time scale in the Cox regression models. 4.3.2 Biomarker associations across a broad range of diseases Among 717 incident disease outcomes, a total of 33,764 significant biomarker associations were observed at a multiple testing corrected p<5e-5. Similarly, 26,035 significant associations were observed for 648 prevalent diseases, and 3,055 associations for 77 causes of death. Notably, these associations extended beyond cardiometabolic diseases, spanning nearly all ICD-10 chapters and highlighting metabolomic biomarkers as risk markers for a broad spectrum of conditions, including cancers, mental health outcomes, and musculoskeletal disorders (Figure 4.4a). Examining the biomarker associations revealed both broad and diseasespecific patterns. For example, GlycA was significantly associated with 32% of the disease endpoints studied, showing a median hazard ratio of 1.26 per 1-SD increment. These associations included conditions such as gout, type 2 diabetes, kidney diseases, and myocardial infarction (Figure 4.4b). Similarly, the ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA) showed widespread associations across various diseases (Figure 4.4c). In contrast, certain biomarkers had more disease specific patterns, such as alanine, which was primarily linked to diabetes and its complications (Figure 4.4d). Total branched-chain amino acids exhibited divergent associations, being positively linked to metabolic diseases but inversely associated with lung diseases and smoking-related conditions (Figure 4.4e). Another study was later on dedicated to exploring these opposing effects of branched-chain amino acids on health and disease, including causal assessments 223 . These findings were further replicated in a meta-analysis of five independent population-based cohorts from Finland, all measured using the same NMR platform. The replication analysis confirmed consistent associations, particularly for amino acids, polar metabolites, and fatty acid ratios. Although some deviations were observed in absolute fatty acid measures and LDL-related biomarkers, the overall concordance between the UK Biobank and the Finnish cohorts demonstrates the robustness and transferability of the findings. 4.3.3 Insights into shared biomarker signatures Through systematic characterization of biomarker associations across diseases, we obtained “biomarker signatures” that capture the unique patterns of biomarker associations for each disease. We then performed 69 ! " ! ! # $ " " # % # " $ # & " ' % # $" ! # # % ( * ) ( + ( & ' ) & , - + #+ Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) $3455 . 5 ( .$$ "0>#* ( 8 - (( -6 * 2( ( ;9 < =77 "* * 0 ( -8 2( + ,5 ( + .86 ( + 96 * * =6 / +( # $7 "* , (+ .68 ( + + (/ ( ( ,5 * * * ( , . ;< * ,5 # %5 : (0 ((/ # ( 95 *# + 1 ( 98 0 ( + 535%6 535%6 $34 43% 3% $37%8 53%$% %38587 6366 365 34% 3%8 534$4 6355 535%6 5$% 536$7 63%87 3488 3 . 5 ( .7 +( ( .$$ "0>#* ( 8 - (( * ( (( 78 ' (( 98$ "* ( + 0 =77 "* * 0 ( -6 * 2( ( ;9 < -8 2( + 96 * * .86 ( + , . ;< * %5 : (0 ((/ # ( =6 / +( # ,5 * * * ( 8 "* + 2 98 0 ( + 4 "* ( +( * 535%6 344% 8%7 535%6 43$5$5 3%$6 3488 367 3%% 53457 3556 3766 73$5$7 535%6 53%67 %3%88 %36 83%$4 73647 838$$ 3$ ,< !%$ @ (( ( 3 >3 $% * ( 3 >3 . ( -6 (( ( 3 >3 . 5 ( .7 +( ( .$ "* (( + 48 - * + > / -. -6 * 2( ( ;9 < , (+ ,5 # .86 ( + .$$ "0>#* ( 98$ "* ( + 0 8 - (( -8 2( + ,5 * * * ( $7 "* , . ;< * 35 37 3$ 36 !?( ;4 + ( ( - + * ( ( (+# # .(/ ( ( / &*0 ( -(0 (( + * 0 + * ( (1 + * ( ( + * + * + * (#0 + * 2 ( + * 2 ( 0 + * # 36 3 !?( ;4 ,< $38$ 357 5357 537 535%6 $37$8 %58 35 36$ 734%5 3%56 53%55 53$$ 63%%% 37 35 %387 365 37 63%4 3 37 36 535 !?( ;4 ,< !%$ @ (( ( 3 >3 . 5 ( .7 +( ( $% * ( 3 >3 . ( .$$ "0>#* ( 6 " > * # + =78 &* =7% .* 78 ' (( 97 ,# * .86 ( + ,5 # 96 * * 8 "* + 2 , . ;< * -6 * 2( ( ;9 < 8 - (( ,5 * * * ( 63577 83$545 435%8 3%4 347 53457 3$%5 63% 435 83657 5355% 53% %3474 53 835% 7375 837$7 5 735 6348 36 35 !?( ;4 3$ ,< Figure 4.4. Biomarker associations for incident disease outcomes across a range of diseases. a) Total number of signi cant associations for each biomarker at a statistical signi cance threshold of p<5e-5. The colour coding denotes the proportion of signi cant associations by ICD-10 chapters. b e) Twenty most signi cant associations for four selected biomarkers: b) glycoprotein acetyls (GlycA), c) ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA), d) alanine, and e) branched-chain amino acids (BCAA). The associations are ranked by descending absolute effect size, with each association represented by hazard ratios (HRs) and 95% con dence intervals (CI) per standard deviation increase in biomarker concentration. All models are adjusted for sex and UK Biobank assessment centre, using age as the time scale in Cox regression. Reprinted from Publication III. 70 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) Figure 4.5. Example comparison of the biomarker signatures for the incidence of acute myocardial infarction (ICD-10 code I21) and heart failure (ICD10 code I50), here illustrated for 37 biomarkers from the NMR platform certi ed for clinical use. Hazard ratios (HRs) for the biomarkers are represented as individual points, with corresponding 95% con dence intervals (CI) displayed through vertical and horizontal error bars. Reprinted from Publication III. clustering analyses of these biomarker signatures to gain insights into differences and similarities between the metabolomic association profiles of diseases. For instance, type 2 diabetes exhibited highly similar biomarker patterns of association with several of its complications, such as retinal disorders and polyneuropathies. In contrast, diseases like pneumonia and bacterial infections, as well as chronic obstructive pulmonary disease (COPD) and lung cancer, clustered together. However, differences were also observed between seemingly related conditions. For instance, different types of heart disease, such as acute myocardial infarction and heart failure, which are often grouped together in composite endpoints for risk prediction, had notably different biomarker associations (Figure 4.5). Further analysis revealed that different subtypes of cardiovascular disease, including various types of chronic ischaemic heart disease, myocardial infarction, and different types of stroke, had distinct association patterns. Interestingly, the biomarkers often had stronger associations with other circulatory diseases than with ischemic stroke and myocardial infarction, suggesting that separating these endpoints could potentially improve risk prediction. 71 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) 4.3.4 Accounting for the effects of lipid lowering medications To account for the potential impact of lipid-lowering medications, we conducted additional analyses excluding individuals using these treatments. However, removing these participants could introduce selection bias by disproportionately excluding individuals with specific health profiles. To address this limitation, we replicated the analyses in the FINRISK 1997 cohort, where the prevalence of cholesterol-lowering medication use was much lower due to sampling occurring before the widespread adoption of statins for primary prevention. Most biomarker associations remained consistent across the two cohorts, including the lack of association between LDL cholesterol and major adverse cardiovascular events (MACE). The inverse associations of LDL cholesterol with chronic kidney failure and all-cause mortality were also replicated. To further evaluate the effects of lipid-lowering treatments, we stratified the analyses by age tertiles. Because the use of cholesterol-lowering and other medications increases with age, younger age groups are less likely to be influenced by these sources of bias. The age-stratified analysis revealed stronger biomarker associations in the younger age tertiles for several biomarkers. Notably, LDL-related biomarkers showed weaker associations in older individuals, with some association magnitudes even reversing direction in non-circulatory diseases, likely due to the higher prevalence of statin use in older populations. Inflammatory biomarkers and several amino acids, which are not affected by lipid-lowering treatments, also displayed stronger associations in younger participants. This suggests that the observed age-related differences are not solely attributable to statin use but may reflect broader age-related changes in biomarker-disease associations. 4.3.5 Conclusions This study highlighted widespread associations of metabolomic biomarkers with a broad range of diseases, indicating that the metabolomic biomarkers are risk markers beyond cardiometabolic diseases. These findings reinforce earlier reports linking metabolomic biomarkers to all-cause mortality and multimorbidity 29,30 , as many biomarkers were found to be associated with leading causes of mortality and morbidity. While inflammatory biomarkers like GlycA have been previously linked to various diseases 143,144 , this study for the first time extends these findings to circulating fatty acids, amino acids, and detailed lipoprotein measures. The widespread associations of biomarkers like GlycA and MUFA% demonstrate their potential to inform on the risk of multi-disease outcomes. The results from this study have been compiled into a publicly available biomarker-disease atlas, an interactive web tool that allows users to visu- 72 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) alize association results and download summary statistics. We anticipate that this resource will serve as a valuable foundation for further research, facilitating deeper investigations beyond biomarker discovery. 4.4 Metabolomic and genomic prediction of common diseases (Publication IV) Building upon the strong associations between metabolomic biomarkers and various diseases established in Publication III, along with the promising multi-biomarker prediction results from Publication II, Publication IV aimed to evaluate the predictive performance of metabolomic multibiomarker risk scores for leading chronic diseases. A primary novelty of this study was the assessment of the transferability of these risk scores across multiple large population cohorts. Additionally, the study compared the predictive power of metabolomic risk scores against other data sources, such as polygenic risk scores and established clinical risk factors. Furthermore, the study included a subset of individuals with biomarker measurements taken at two time points to assess whether changes in biomarker profiles over time could alter future disease risks. This research was a collaborative effort involving many scientists and extensive work in creating and curating the data resources. However, in this section, I will focus on my primary contributions, which involved the development of risk prediction models and the evaluation of their performance and calibration across multiple biobanks. 4.4.1 Study setting and methodology This study leveraged NMR metabolomic biomarker profiles from three major population biobanks: the UK Biobank, Estonian Biobank, and Finnish THL Biobank, comprising 700,217 individuals with extensive health outcome data. In this study, we focused on 12 leading causes of morbidity in the WHO European region, which together account for over one-third of total disability-adjusted life years (DALYs) in this population. Disease outcomes were defined as the four-year incidence of these conditions, accommodating the shorter follow-up period available in the Estonian Biobank. To derive metabolomic risk scores for predicting disease incidence, regularized Cox proportional hazards regression was used to train the models in a half of the UK Biobank population. Age and sex were included as fixed covariates, and LASSO regularization was applied for variable selection among the 36 metabolomic biomarkers certified for clinical use, using 5-fold cross-validation to tune the regularization parameters. The performance of these metabolomic risk scores was then tested in the remaining half of 73 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) the UK Biobank and externally validated using data from the Estonian and THL biobanks. Importantly, since the biomarkers are measured in absolute concentration units (e.g. g/L or mmol/L), the metabolomic risk scores derived from the UK Biobank could be directly applied to the other cohorts without requiring cohort-specific normalization of the biomarker values. This is a notable distinction from typical ’omics analyses, where such normalization is often necessary to account for varying measurement scales, platforms, and batch effects 224,225 . 4.4.2 Metabolomic risk scores stratify disease risk The metabolomic risk scores effectively stratified disease risk across all 12 conditions, demonstrating a clear increase in event rates with higher risk score percentiles (Figure 4.6). The risk increase was particularly elevated in the upper tails of the distribution, with the effect being notably pronounced for type 2 diabetes, alcoholic liver disease and liver fibrosis and cirrhosis. In terms of discrimination, the majority of the metabolomic risk scores demonstrated good performance across the studied diseases, with area under the ROC curve (AUC) ranging from 0.70 to 0.95, except for depression, which had a lower AUC. The models were further evaluated to illustrate their utility in a practical risk prediction scenario. Specifically, we established a threshold corresponding to the top decile of risk scores in the UK Biobank training set. Individuals in other cohorts who exceeded this threshold were designated as the high-risk group and compared against the remaining population. Across all three biobanks, this high-risk group consistently exhibited elevated disease risks, as indicated by hazard ratios (Figure 4.7). Exceptions included alcoholic liver disease and depression, which demonstrated statistically significant heterogeneity in the meta-analysis (Cochran’s Q-test, multiple testing corrected p<0.004). The meta-analysis comparing this high-risk group to the remaining population revealed hazard ratios of approximately 10 for liver disease and diabetes, around 4 for lung cancer and chronic obstructive pulmonary disease (COPD), and around 2.5 for myocardial infarction, stroke, and vascular dementia. Notably, the UK Biobank showed the highest effect sizes for only 4 out of the 12 diseases, indicating that the metabolomic risk scores and their associations are generalizable across study populations rather than being overfitted to specific chracteristics of the UK Biobank training data. Translating risk prediction models from biobank research into clinical practice requires both robust discrimination and calibration. Calibration was assessed by comparing observed and predicted event rates across deciles within each biobank and computing the corresponding calibration slopes. Generally, the metabolomic risk scores reflected good calibration. In the UK Biobank test set, calibration slopes ranged from 0.95 to 1.24 across 74 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) Myocardial infarction Ischemic stroke Intracerebral hemorrhage Lung cancer 0.25% 2.5% 2.0% 1.5% 0.9% 0.20% 1.5% 0.15% 1.0% 0.6% 0.10% 1.0% 0.5% 0.5% 0.0% 0.00% 0.0% 0 25 50 75 100 0.3% 0.05% 0 Type 2 diabetes 25 50 75 100 0.0% 0 Chronic obstructive pulmonary disease 25 50 75 100 0 Alzheimer's disease 25 50 75 100 Vascular and other dementia Incidence (%) 0.5% 3% 10% 0.4% 0.10% 0.3% 2% 5% 0.2% 0.05% 1% 0% 0.1% 0.00% 0% 0 25 50 75 100 0 Depressive disorders 25 50 75 100 0.0% 0 Alcoholic liver disease 25 50 75 100 1.5% 2.4% 0 Cirrhosis of the liver 25 50 75 100 Colon and rectum cancers 1.00% 1.5% 2.0% 0.75% 1.0% 1.0% 1.6% 0.50% 0.5% 0.5% 1.2% 0.25% 0.0% 0 25 50 75 100 0.0% 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Percentile of metabolomic score Figure 4.6. Observed incidence of the 12 diseases across one-percent bins of the metabolomic risk score, calculated as a sample size-weighted mean of the 4-year incidence across the three biobank cohorts (N = 481,678). The red shaded area represents the top 10% of age and sex adjusted metabolomic risk score. The horizontal dashed line indicates overall population incidence. Reprinted from Publication IV. diseases, which is expected since the models were trained using data from the same biobank. In the Estonian Biobank, calibration slopes ranged from 0.76 to 1.16, except for depression, which exhibited a lower slope of 0.42, likely reflecting differences in diagnostic criteria and recording practices between countries. In the THL Biobank, calibration slopes ranged between 1.03-1.21. These results suggest that the metabolomic risk scores achieve a reasonable degree of calibration across diverse population cohorts, with some variability influenced by cohort-specific factors. Differences in participant recruitment strategies, such as the volunteer-based enrollment in the UK Biobank and Estonian Biobank, may contribute to these inconsistencies. Nonetheless, this level of calibration is comparable to widely used clinical tools. For instance, the Pooled Cohort Equations for cardiovascular risk assessments have shown similar calibration when applied across different populations, such as between the US and Canada 226 . Similarly, the QRISK3 cardiovascular risk model, widely implemented in the UK, has demonstrated comparable calibration performance when applied to 75 Predictive modelling of disease risks using metabolomic biomarkers (Publications II-IV) Myocardial infarction Meta−analysis Ischemic stroke UK Biobank THL Biobank Estonian Biobank Intracerebral hemorrhage Lung cancer Type 2 diabetes Chronic obstructive pulmonary disease Alzheimer's disease Vascular and other dementia Depressive disorders Alcoholic liver disease Cirrhosis of the liver Colon and rectum cancers 1 3 5 10 30 Hazard ratio (95% CI), highest risk decile vs. remaining population Figure 4.7. Hazard ratios for metabolomic risk scores across the 12 studied diseases, comparing individuals in the highest risk decile to the rest of the population. Results are presented for three biobanks: UK Biobank (purple), THL Biobank (teal), and Estonian Biobank (orange) (N = 481,678). Meta-analysis of these results from the three biobanks is displayed in black. Filled circles indicate statistically signi cant associations (p<0.004), while hollow circles represent non-signi cant associations. Horizontal error bars show the 95% con dence intervals (CI). Reprinted from Publication IV. external cohorts like the UK Biobank 227 . 4.4.3 Conclusions This study highlights the potential of NMR metabolomic biomarkers in stratifying the risk for multiple chronic diseases. By integrating data from three large population biobanks, comprising over 700,000 individuals, we developed and validated metabolomic risk scores for 12 major diseases that contribute substantially to the burden of chronic diseases. Notably, the ability to apply these risk scores across different biobanks without cohort-specific normalization demonstrates robustness and transferability of the models. The consistent calibration and predictive performance observed across cohorts, despite differences in age ranges, fasting protocols and disease prevalence, indicate the potential utility of these risk scores in diverse real-world settings. These findings further position metabolomic biomarkers as promising tools for comprehensive multi-disease risk assessment. 76 5. Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) In Publication V, we aimed to address the challenge of estimating comprehensive interaction effects among predictors in time-to-event prediction models. We introduced survivalFM, a novel machine learning extension to Cox proportional hazards regression. Unlike standard Cox regression, which assumes linear effects and requires pre-specified interaction terms, survivalFM automatically estimates all potential interaction effects among predictor variables. This method builds on the concept of factorizing the interaction parameters from factorization machines (FM). This concept was successfully applied in Publication I for predicting drug combination responses and here for the first time taken to the context of survival analysis. While survivalFM can be used for modelling any time-to-event outcome, in Publication V, we highlight its applicability in disease risk prediction, using data from the UK Biobank. 5.1 Foundations of survivalFM survivalFM is an extension of the widely used Cox proportional hazards model 32 , which relates time-to-event outcomes to a set of predictor variables through a hazard function defined as: h(t|x) = h 0 (t) exp( f (x)) (5.1) where h0 (t) is the baseline hazard and exp( f (x)) is the partial hazard. In the standard formulation, the partial hazard is parameterized by a linear combination of predictor variables f (x) = Ø> x. survivalFM extends this formulation by incorporating an estimation of all pairwise interaction effects through a factorized parametrization (Figure 5.1a-b): f (x) = Ø> x + X 1∑ i 6= j ∑ d hp i , p j i x i x j (5.2) where h·, ·i denotes the inner product, and d represents the number of predictor variables. The first part captures the linear effects of the 77 Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) predictors, similar to the standard Cox regression model. The second part captures the pairwise interaction effects between all predictors x i and x j . Rather than directly estimating the interaction terms Ø i j , the factorized parameterization approximates these effects using an inner product of two low-rank latent vectors Ø̃ i j = hp i , p j i. This approach substantially reduces the number of parameters to be estimated, as the rank of the factorization is typically much lower than the total number of predictors (k ø d ). Further details of this factorization approach are described in Section 2.5.2. This method avoids the statistical and computational challenges associated with directly estimating interaction terms when numerous predictor variables are present. Coupled with an efficient quasi-Newton optimization algorithm (BFGS; Broyden–Fletcher–Goldfarb–Shanno 228–231 ), this method facilitates comprehensive modeling of interaction effects even with many predictor variables. Notably, unlike many other advanced machine learning methods for survival analysis, this approach preserves the interpretability of the underlying model through access to the estimated effects of the individual predictors and their interactions. 5.2 Study population and evaluation settings To analyze if survivalFM could enhance disease risk prediction performance and offer insights into risk factor interactions, we conducted analyses using data from the UK Biobank. We focused on the 10-year incidence of ten selected diseases as our outcomes of interest. To evaluate model performance across varying data sources, we designed four prediction scenarios incorporating a range of predictors, from standard clinical variables to ’omics-based data sources, including biochemistry measures, blood counts, metabolomic biomarkers, and polygenic risk scores. To determine the benefits of survivalFM in capturing complex interaction effects, we compared its performance to that of standard Cox proportional hazards regression (Figure 5.1b), applying L2 regularization in both methods to manage complexity and mitigate overfitting. Model performance was assessed using 10-fold cross-validation, with a 20% validation set in each fold used to optimize the regularization parameters. 5.3 Improved prediction of disease risk across various settings We demonstrated that survivalFM is capable of identifying predictive interaction terms and improving risk prediction accuracy (Figure 5.2). In terms of discrimination, measured by the concordance index, survivalFM showed statistically significant improvements in 26 out of the 40 evaluated scenarios (65%), with an average increase in concordance index (¢C-index) 78 Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) a Comprehensive e interaction n modelling g by y machine e learning:: survivalFM Linear effects All potential interaction effects β ∈ Rd β ∈ Rd×d βi d b βj βi,j d Low-rank parameter matrix of the factor vectors Factorized parametrization P ∈ Rd×k βi,j ≈ pi , pj pi ≈ pj d k d Method d evaluation Standard d Cox x regression Time to event ~ β x SurvivalFM Time to event ~ β x + pi , pj xi xj 1≤i=j≤d Figure 5.1. Overview the method. a)Biobank A machine learning method for survival Disease prediction of examples in the UK analysis, survivalFM, is designed to estimate linear and all pairwise i) Case studies with various data modalities ii) Clinical example – QRISK3 interaction effects among predictors using a factorized parametrization of the interaction effects ⟨p i , p j ⟩. d denotes the number of predictors and k is a hyperparameter that de nes the factorization rank for the interaction terms. The rank of the factorization is typically much lower than the number of predictor variables ( k d ), enabling computation of the interaction terms even in the presence of many predictors. b) The bene ts of integrating comprehensive interaction terms via survivalFM is evaluated by comparing the prediction performance to the standard linear Cox regression. Reprinted and modi ed from Publication V. of 0.005. In terms of continuous net reclassification improvement (NRI), survivalFM significantly improved risk reclassification in 39 out of 40 scenarios (97.5%), yielding an average continuous NRI of 37%. Hence, despite the relatively modest gains in C-indices, the substantial improvement in continuous NRI suggests notable improvements in individual risk predictions. A major advantage of survivalFM is its ability to introduce non-linearity by incorporating comprehensive interaction terms, while still maintaining interpretability. It achieves this by providing estimates for both main effects and interactions. Our analysis of these interaction effects revealed that numerous small interactions often jointly improved prediction accuracy, underscoring the value of capturing the complete interaction struc- 79 Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) ture. Moreover, we demonstrated that capturing these interaction effects generally requires large sample sizes, with survivalFM showing an increasing performance advantage over standard Cox regression as the sample size grows. Figure 5.2. Comparison of the predictive performance of survivalFM against standard linear Cox proportional hazards regression, shown in terms of differences in concordance index (Δ C-index) and continuous net reclassi cation improvement (NRI). Results are presented for ten disease outcomes (y-axis), considering four different sources of data: a) standard risk factors (blue; included in all models), b) clinical biochemistry and blood counts (red), c) metabolomic biomarkers (orange) and d) polygenic risk scores (green). Horizontal error bars represent 95% con dence intervals (CIs), estimated using bootstrapping with 1000 resamples. Reprinted from Publication V. 80 Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) 5.4 Enhanced cardiovascular risk prediction performance We further demonstrated that survivalFM can enhance prediction accuracy in a practical scenario for cardiovascular disease risk prediction. Using predictors from the widely implemented QRISK3 model 20 , recommended by the UK clinical guidelines 88 , we compared three models of increasing complexity: (1) a standard Cox regression model with linear terms, (2) a Cox regression model including linear terms and age interaction terms currently included QRISK3, and (3) a survivalFM model that includes linear terms along with comprehensive interaction effects. In terms of discrimination, survivalFM showed a statistically significant improvement with a ¢C-index of 0.0018 (95% CI 0.0013–0.0023) compared to the standard linear model and ¢C-index of 0.0014 (95% CI 0.0010–0.0019) compared to the model incorporating also the QRISK3 age interactions (Figure 5.3a). In terms of categorical reclassification at a clinically established 10% risk threshold 88 , adding the QRISK3 age interaction terms provided an overall net reclassification improvement (NRI) of 0.66% (95% CI: 0.40%–0.93%) over the linear model, while survivalFM achieved a greater NRI of 1.47% (95% CI: 1.12%–1.82%) (Figure 5.3b-c). Therefore, these results show that survivalFM more than doubles the performance gains compared to the improvements seen with the currently included QRISK3 age-interaction terms alone. 5.5 Conclusions In conclusion, Publication V introduced survivalFM, a method that extends Cox regression by adding an estimation all pairwise interaction effects on time-to-event outcomes. We showed that incorporating these comprehensive interaction effects improves risk prediction performance across various common diseases and sources of data. Notably, these improvements were achieved by optimizing the use of existing predictors, without the need for additional data types. In contrast to many advanced machine learning techniques in survival analysis, a notable advantage of survivalFM is that it retains interpretability by providing estimated effects for both individual predictors and their interactions. In Publication V, we further discuss some of the identified interactions and diseases where survivalFM showed the greatest benefits. Given its generalizability to other contexts, survivalFM is expected to find use cases in precision medicine and improve risk prediction model development. 81 Machine learning with comprehensive interaction modelling for disease risk prediction (Publication V) Figure 5.3. Evaluation of the performance of survivalFM in a CVD risk prediction scenario involving predictors from QRISK3 (N = 344 292 with complete data, 21 534 events). a) Discrimination performance assessed using concordance index (C-index) across three models: (1) standard Cox regression with linear terms, (2) standard Cox regression with linear terms and age interaction terms from QRISK3, and (3) survivalFM model incorporating linear terms along with all factorized pairwise interactions. b) Categorical net reclassi cation improvement (NRI) at a 10% absolute risk threshold, computed relative to standard Cox model with linear terms. Horizontal error bars indicate 95% con dence intervals (CIs), estimated via bootstrapping with 1000 resamples. c) Reclassi cation plots illustrating how including interaction terms in the more advanced models modi es individual risk predictions. Reprinted from Publicaion V. 82 6. Concluding remarks In this dissertation, we have developed computational methods and uncovered novel biological insights that contribute to various aspects of precision medicine, including predicting the effects of drug combination treatments, utilizing metabolomic biomarkers in disease risk assessment, and enhancing methodological approaches for disease risk prediction. Given the vast number of possible drug combinations, computational methods are essential for guiding experimental research by prioritizing the most promising combinations for further validation 58,59,178 . In Publication I, we introduced comboFM, a novel machine learning framework designed to predict dose-specific responses to drug combinations. Leveraging data from a large drug combination cancer cell line screen, we showed that comboFM consistently demonstrates accurate prediction performance across multiple tissue types and drug classes in diverse prediction scenarios. Importantly, it was able to generalize predictions to new drug combinations not observed during training, offering valuable insights for repositioning existing drugs into new combinations in cancer treatment. Experimental validation of previously untested combinations further confirmed the robustness of these predictions, highlighting the potential of comboFM to advance the development of effective combination therapies in precision oncology. Moreover, many drugs predicted by comboFM were found to be currently evaluated in clinical trials against the specific cancer types, either as single agents or in combinations with other drugs, highlighting its translational potential. Looking forward, several research avenues can expand upon the capabilities of comboFM. While the current work validated it using cancer cell lines, the framework could be adapted to patient-derived samples as such data become more widely available 232–234 . Given that the input data required for comboFM are becoming routinely accessible in functional precision medicine studies, the framework holds potential for broad applicability across various cancer types and therapeutic contexts. However, as with many molecular profiling technologies, challenges remain in ensuring that the predictions are reliable across different experimental assay conditions and biological contexts. Therefore, future work should validate these 83 Concluding remarks models using diverse and well-standardized cell line and patient-derived datasets to support the wider application of comboFM. Additionally, comboFM has already inspired further methodological advancements, such as comboLTR 235 , which extends the current framework by removing the assumption of polynomial symmetry through the use of a latent tensor reconstruction (LTR) technique. Beyond treatment strategies, this dissertation also explored risk prediction aspects of precision medicine. Comprehensive biomarker profiling can provide means for simultaneous risk stratification across multiple diseases. One promising approach involves the use of NMR metabolomics to profile small molecules and lipids in blood samples. Publications II–IV leveraged uniquely large, population-scale NMR metabolomics datasets to identify novel biomarkers and establish associations of metabolomic biomarkers across a wide range of health outcomes, including conditions where metabolomics had not previously been studied at scale. For instance, Publication II showed that metabolomic biomarkers primarily linked to cardiometabolic diseases can also predict susceptibility to severe infectious disease outcomes such as hospitalization or death from COVID-19. Publication III extended these findings by demonstrating widespread associations of many of these biomarkers across a broad spectrum of common diseases. In Publication IV, metabolomic risk scores were derived and tested for 12 major chronic diseases, demonstrating consistent predictive performance across three large biobanks. Collectively, these studies illustrate the potential of comprehensive metabolomic biomarker profiling as a tool for disease risk prediction. Future research can build upon these findings to further explore the role of metabolomics in risk prediction and other applications in precision medicine. Subsequent studies using the UK Biobank metabolomics dataset have already resulted in numerous publications, covering topics such as causal analyses 128,130,236 and other risk prediction studies 151–153 . While large biobanks are effective for investigating multiple diseases simultaneously, complementary studies in disease-specific clinical cohorts will help bring greater resolution to these findings. For instance, metabolomic analyses in clinical cohorts could be used to develop and validate models to predict the risk of disease progression or complications, extending beyond the primary prevention focus on first incidence studied here. Additionally, the UK Biobank participants are not fully representative of the broader population; as middle-aged volunteers, they tend to be healthier than average 237,238 . Therefore, studies in more diverse and underrepresented populations will be essential for further validating these findings. Future work should also evaluate how modern risk prediction tools, such as those exemplified here, can be integrated into existing clinical workflows to improve patient stratification and decision-making. Continuing with disease risk prediction, we also addressed its method- 84 Concluding remarks ological aspects. Capturing complex nonlinear relationships, such as interactions among predictor variables, can improve the accuracy of risk prediction models. In Publication V, we introduced survivalFM, a novel machine learning method for modelling time-to-event outcomes, such as disease risks. This method extends the widely used Cox proportional hazards regression by incorporating comprehensive interaction effects among predictor variables using a factorized parametrization approach, similar to the one applied in Publication I for predicting drug combination effects. We showed that accounting for the comprehensive interactions improves the accuracy of risk prediction models across various disease outcomes and data sources. A notable advantage of survivalFM, compared to many advanced machine learning techniques in survival analysis, is its ability to introduce non-linearity through comprehensive interaction terms while maintaining model interpretability. It provides estimated effects for both individual predictors and their interactions, making it particularly valuable in settings where understanding the contributions of predictors is essential for translational applications. Looking ahead, several avenues exist for expanding the applications of survivalFM. Given its generalizability, we anticipate survivalFM to find use cases in precision medicine and enhance time-to-event modeling in large-scale studies involving many predictors. While the method demonstrated improved prediction performance in the UK Biobank, we also showed that capturing predictive interaction effects generally requires large sample sizes. This may constrain its use in smaller cohorts. However, emerging biobank initiatives with extensive clinical and omics data provide opportunities for further validation. The generalizable nature of survivalFM allows its application to data sources beyond those addressed in this work, such as incorporating multi-omics data to leverage predictive interactions across multiple molecular layers. For instance, proteomics has recently shown promise in risk prediction 109–111 , and given a sufficiently large sample size, survivalFM could be applied to uncover predictive protein-protein interactions. Additionally, from a methodological perspective, there is potential to extend survivalFM to capture higherorder interactions involving more than two predictors. This could further improve prediction performance by capturing more complex relationships. In conclusion, this dissertation has made significant contributions to precision medicine through the development of predictive modelling approaches and their application to diverse biomedical datasets. By identifying promising drug combinations for cancer treatment, discovering novel metabolomic biomarkers, and developing methods and models for disease risk prediction, we have demonstrated how computational approaches can transform molecular data into actionable insights. Individually and collectively, these findings advance the translation of molecular data into prevention and treatment strategies in precision medicine. 85 References [1] Francis S Collins and Harold Varmus. A new initiative on precision medicine. New England Journal of Medicine, 372(9):793–795, 2015. [2] The White House, Office of the Press Secretary. President Obama’s Precision Medicine Initiative. https://obamawhitehouse.archives.gov/the-press-office/ 2015/01/30/fact-sheet-president-obama-s-precision-medicine-initiative, 2015. Accessed: 2024-09-11. [3] Clare Turnbull, Richard H Scott, Ellen Thomas, Louise Jones, Nirupa Murugaesu, Freya Boardman Pretty, Dina Halai, Emma Baple, Clare Craig, Angela Hamblin, et al. The 100 000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ, 361, 2018. [4] Our Future Health Protocol. Protocol version 5.0 https://ourfuturehealth. org.uk/our-research-mission/. Accessed August 12th 2024. , 2024. [5] Yves Lévy. Genomic medicine 2025: France in the race for precision medicine. The Lancet, 388(10062):2872, 2016. [6] Eleanor Wong, Nicolas Bertin, Maxime Hebrard, Roberto TiradoMagallanes, Claire Bellis, Weng Khong Lim, Chee Yong Chua, Philomena Mei Lin Tong, Raymond Chua, Kenneth Mak, et al. The singapore national precision medicine strategy. Nature Genetics, 55(2):178–186, 2023. [7] Shirley Musich, Shaohung Wang, Kevin Hawkins, and Andrea Klemes. The impact of personalized preventive care on health care quality, utilization, and expenditures. Population health management, 19(6):389–397, 2016. [8] Euan A Ashley. Towards precision medicine. Nature Reviews Genetics, 17(9):507–522, 2016. [9] Holger Fröhlich, Rudi Balling, Niko Beerenwinkel, Oliver Kohlbacher, Santosh Kumar, Thomas Lengauer, Marloes H Maathuis, Yves Moreau, Susan A Murphy, Teresa M Przytycka, et al. From hype to reality: data science enabling personalized medicine. BMC Medicine, 16:1–15, 2018. [10] Karl Landsteiner. Agglutination phenomena in normal human blood. Wien Klin Wochenschr, 14:1132–4, 1901. [11] Mohan Babu and Michael Snyder. Multi-omics profiling for health. Molecular & Cellular Proteomics, 22(6), 2023. [12] Yehudit Hasin, Marcus Seldin, and Aldons Lusis. Multi-omics approaches to disease. Genome Biology, 18:1–15, 2017. 87 References [13] Sanjiv Sam Gambhir, T Jessie Ge, Ophir Vermesh, and Ryan Spitler. Toward achieving precision health. Science Translational Medicine, 10(430):eaao3612, 2018. [14] Joshua C Denny and Francis S Collins. Precision medicine in 2030—seven ways to transform healthcare. Cell, 184(6):1415–1419, 2021. [15] Kevin B Johnson, Wei-Qi Wei, Dilhan Weeraratne, Mark E Frisse, Karl Misulis, Kyu Rhee, Juan Zhao, and Jane L Snowdon. Precision medicine, AI, and the future of personalized health care. Clinical and translational science, 14(1):86–93, 2021. [16] Andrew L Beam and Isaac S Kohane. Big data and machine learning in health care. JAMA, 319(13):1317–1318, 2018. [17] Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. Machine learning in medicine. New England Journal of Medicine, 380(14):1347–1358, 2019. [18] Bruna Gomes and Euan A Ashley. Artificial intelligence in molecular medicine. New England Journal of Medicine, 388(26):2456–2465, 2023. [19] Sarah J MacEachern and Nils D Forkert. Machine learning for precision medicine. Genome, 64(4):416–425, 2021. [20] Julia Hippisley-Cox, Carol Coupland, and Peter Brindle. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ, 357, 2017. [21] Stephen Kaptoge, Lisa Pennells, Dirk De Bacquer, Marie Therese Cooney, Maryam Kavousi, Gretchen Stevens, Leanne Margaret Riley, Stefan Savin, Taskeen Khan, Servet Altay, et al. World Health Organization cardiovascular disease risk charts: revised models to estimate risk in 21 global regions. The Lancet Global Health, 7(10):e1332–e1345, 2019. [22] SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. European Heart Journal, 42(25):2439– 2454, 2021. [23] Amit V Khera, Mark Chaffin, Krishna G Aragam, Mary E Haas, Carolina Roselli, Seung Hoan Choi, Pradeep Natarajan, Eric S Lander, Steven A Lubitz, Patrick T Ellinor, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9):1219–1224, 2018. [24] Ali Torkamani, Nathan E Wineinger, and Eric J Topol. The personal and clinical utility of polygenic risk scores. Nature Reviews Genetics, 19(9):581– 590, 2018. [25] George Nicholson, Mattias Rantalainen, Anthony D Maher, Jia V Li, Daniel Malmodin, Kourosh R Ahmadi, Johan H Faber, Ingileif B Hallgrímsdóttir, Amy Barrett, Henrik Toft, et al. Human metabolic profiles are stably controlled by genetic and environmental variation. Molecular systems biology, 7(1):525, 2011. [26] Robert W McGarrah, Scott B Crown, Guo-Fang Zhang, Svati H Shah, and Christopher B Newgard. Cardiovascular metabolomics. Circulation research, 122(9):1238–1258, 2018. [27] Zsu-Zsu Chen and Robert E Gerszten. Metabolomics and proteomics in type 2 diabetes. Circulation research, 126(11):1613–1627, 2020. 88 References [28] Peter Würtz, Antti J Kangas, Pasi Soininen, Debbie A Lawlor, George Davey Smith, and Mika Ala-Korpela. Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on-omic technologies. American Journal of Epidemiology, 186(9):1084–1096, 2017. [29] Maik Pietzner, Isobel D Stewart, Johannes Raffler, Kay-Tee Khaw, Gregory A Michelotti, Gabi Kastenmüller, Nicholas J Wareham, and Claudia Langenberg. Plasma metabolites to profile pathways in noncommunicable disease multimorbidity. Nature Medicine, 27(3):471–479, 2021. [30] Joris Deelen, Johannes Kettunen, Krista Fischer, Ashley van der Spek, Stella Trompet, Gabi Kastenmüller, Andy Boyd, Jonas Zierer, Erik B van den Akker, Mika Ala-Korpela, et al. A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nature Communications, 10(1):3346, 2019. [31] Pasi Soininen, Antti J Kangas, Peter Würtz, Teemu Suna, and Mika AlaKorpela. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circulation: Cardiovascular Genetics, 8(1):192–206, 2015. [32] David R Cox. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2):187–202, 1972. [33] Hemant Ishwaran, Udaya B Kogalur, Eugene H Blackstone, and Michael S Lauer. Random survival forests. The Annals of Applied Statistics, 2(3):841– 860, 2008. [34] Jared L Katzman, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. DeepSurv: personalized treatment recommender system using a cox proportional hazards deep neural network. BMC Medical Research Methodology, 18:1–12, 2018. [35] Chirag Nagpal, Xinyu Li, and Artur Dubrawski. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health Informatics, 25(8):3163–3175, 2021. [36] David J Hunter and Christopher Holmes. Where medical statistics meets artificial intelligence. New England Journal of Medicine, 389(13):1211– 1219, 2023. [37] Rebecca Giddings, Anabel Joseph, Thomas Callender, Sam M Janes, Mihaela Van der Schaar, Jessica Sheringham, and Neal Navani. Factors influencing clinician and patient interaction with machine learning-based risk prediction models: a systematic review. The Lancet Digital Health, 6(2):e131–e144, 2024. [38] Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019. [39] Gregor Stiglic, Primoz Kocbek, Nino Fijacko, Marinka Zitnik, Katrien Verbert, and Leona Cilar. Interpretability of machine learning-based prediction models in healthcare. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(5):e1379, 2020. [40] Chris Finan, Anna Gaulton, Felix A Kruger, R Thomas Lumbers, Tina Shah, Jorgen Engmann, Luana Galver, Ryan Kelley, Anneli Karlsson, Rita Santos, et al. The druggable genome and support for target identification and validation in drug development. Science translational medicine, 9(383):eaag1166, 2017. 89 References [41] Jussi Paananen and Vittorio Fortino. An omics perspective on drug target discovery platforms. Briefings in bioinformatics, 21(6):1937–1953, 2020. [42] Philippe L Bedard, David M Hyman, Matthew S Davids, and Lillian L Siu. Small molecules, big impact: 20 years of targeted therapy in oncology. The Lancet, 395(10229):1078–1088, 2020. [43] Bert Vogelstein, Nickolas Papadopoulos, Victor E Velculescu, Shibin Zhou, Luis A Diaz Jr, and Kenneth W Kinzler. Cancer genome landscapes. Science, 339(6127):1546–1558, 2013. [44] David M Hyman, Barry S Taylor, and José Baselga. Implementing genomedriven oncology. Cell, 168(4):584–599, 2017. [45] Charles Sawyers. Targeted cancer therapy. Nature, 432(7015):294–297, 2004. [46] Caitriona Holohan, Sandra Van Schaeybroeck, Daniel B Longley, and Patrick G Johnston. Cancer drug resistance: an evolving paradigm. Nature Reviews Cancer, 13(10):714–726, 2013. [47] Haojie Jin, Liqin Wang, and René Bernards. Rational combinations of targeted cancer therapies: background, advances and challenges. Nature Reviews Drug Discovery, 22(3):213–234, 2023. [48] Deborah Plana, Adam C Palmer, and Peter K Sorger. Independent drug action in combination therapy: implications for precision oncology. Cancer discovery, 12(3):606–624, 2022. [49] Bissan Al-Lazikani, Udai Banerji, and Paul Workman. Combinatorial drug therapy for cancer in the post-genomic era. Nature Biotechnology, 30(7):679–692, 2012. [50] Pradipta Das, Michael D Delost, Munaum H Qureshi, David T Smith, and Jon T Njardarson. A survey of the structures of US FDA approved combination drugs. Journal of medicinal chemistry, 62(9):4265–4311, 2018. [51] Jia Jia, Feng Zhu, Xiaohua Ma, Zhiwei W Cao, Yixue X Li, and Yu Zong Chen. Mechanisms of drug combinations: interaction and network perspectives. Nature Reviews Drug Discovery, 8(2):111–128, 2009. [52] Salvador Fudio, Alvaro Sellers, Laura Pérez Ramos, Beatriz Gil-Alberdi, Ali Zeaiter, Mikel Urroz, Antonio Carcas, and Rubin Lubomirov. Anticancer drug combinations approved by US FDA from 2011 to 2021: main design features of clinical trials and role of pharmacokinetics. Cancer Chemotherapy and Pharmacology, 90(4):285–299, 2022. [53] Susan L Holbeck, Richard Camalier, James A Crowell, Jeevan Prasaad Govindharajulu, Melinda Hollingshead, Lawrence W Anderson, Eric Polley, Larry Rubinstein, Apurva Srivastava, Deborah Wilsker, et al. The National Cancer Institute ALMANAC: a comprehensive screening resource for the detection of anticancer drug pairs with enhanced therapeutic activity. Cancer Research, 77(13):3564–3576, 2017. [54] Patricia Jaaks, Elizabeth A Coker, Daniel J Vis, Olivia Edwards, Emma F Carpenter, Simonetta M Leto, Lisa Dwane, Francesco Sassi, Howard Lightfoot, Syd Barthorpe, et al. Effective drug combinations in breast, colon and pancreatic cancer cells. Nature, 603(7899):166–173, 2022. [55] Nishanth Ulhas Nair, Patricia Greninger, Xiaohu Zhang, Adam A Friedman, Arnaud Amzallag, Eliane Cortez, Avinash Das Sahu, Joo Sang Lee, Anahita Dastur, Regina K Egan, et al. A landscape of response to drug combinations in non-small cell lung cancer. Nature Communications, 14(1):3830, 2023. 90 References [56] Zohar B Weinstein, Andreas Bender, and Murat Cokol. Prediction of synergistic drug combinations. Current Opinion in Systems Biology, 4:24–28, 2017. [57] Lianlian Wu, Yuqi Wen, Dongjin Leng, Qinglong Zhang, Chong Dai, Zhongming Wang, Ziqi Liu, Bowei Yan, Yixin Zhang, Jing Wang, et al. Machine learning methods, databases and tools for drug combination prediction. Briefings in Bioinformatics, 23(1):bbab355, 2022. [58] Weikaixin Kong, Gianmarco Midena, Yingjia Chen, Paschalis Athanasiadis, Tianduanyi Wang, Juho Rousu, Liye He, and Tero Aittokallio. Systematic review of computational methods for drug combination prediction. Computational and structural biotechnology journal, 20:2807–2814, 2022. [59] Anna Torkamannia, Yadollah Omidi, and Reza Ferdousi. A review of machine learning approaches for drug synergy prediction in cancer. Briefings in Bioinformatics, 23(3):bbac075, 2022. [60] Fei Zhou, Ting Yu, Ronghui Du, Guohui Fan, Ying Liu, Zhibo Liu, Jie Xiang, Yeming Wang, Bin Song, Xiaoying Gu, et al. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The Lancet, 395(10229):1054–1062, 2020. [61] Matthew J Cummings, Matthew R Baldwin, Darryl Abrams, Samuel D Jacobson, Benjamin J Meyer, Elizabeth M Balough, Justin G Aaron, Jan Claassen, LeRoy E Rabbani, Jonathan Hastie, et al. Epidemiology, clinical course, and outcomes of critically ill adults with COVID-19 in New York City: a prospective cohort study. The Lancet, 395(10239):1763–1770, 2020. [62] Mathieu Blondel, Akinori Fujino, Naonori Ueda, and Masakazu Ishihata. Higher-order factorization machines. Advances in Neural Information Processing Systems, 29, 2016. [63] Teri A Manolio. Genomewide association studies and assessment of the risk of disease. New England Journal of Medicine, 363(2):166–176, 2010. [64] Peter M Visscher, Naomi R Wray, Qian Zhang, Pamela Sklar, Mark I McCarthy, Matthew A Brown, and Jian Yang. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics, 101(1):5–22, 2017. [65] Matthew R Nelson, Hannah Tipney, Jeffery L Painter, Judong Shen, Paola Nicoletti, Yufeng Shen, Aris Floratos, Pak Chung Sham, Mulin Jun Li, Junwen Wang, et al. The support of human genetic evidence for approved drug indications. Nature Genetics, 47(8):856–860, 2015. [66] Emily A King, J Wade Davis, and Jacob F Degner. Are drug targets with genetic support twice as likely to be approved? revised estimates of the impact of genetic support for drug mechanisms on the probability of drug approval. PLoS genetics, 15(12):e1008489, 2019. [67] Emil Uffelmann, Qin Qin Huang, Nchangwi Syntia Munung, Jantina De Vries, Yukinori Okada, Alicia R Martin, Hilary C Martin, Tuuli Lappalainen, and Danielle Posthuma. Genome-wide association studies. Nature Reviews Methods Primers, 1(1):59, 2021. [68] Eleftheria Zeggini, Anna L Gloyn, Anne C Barton, and Louise V Wain. Translational genomics and precision medicine: Moving from the lab to the clinic. Science, 365(6460):1409–1413, 2019. [69] Cathryn M Lewis and Evangelos Vassos. Polygenic risk scores: from research tools to clinical instruments. Genome Medicine, 12(1):44, 2020. 91 References [70] Tjeerd Van Der Ploeg, Peter C Austin, and Ewout W Steyerberg. Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology, 14:1–13, 2014. [71] Evangelia Christodoulou, Jie Ma, Gary S Collins, Ewout W Steyerberg, Jan Y Verbakel, and Ben Van Calster. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of Clinical Epidemiology, 110:12–22, 2019. [72] Frank E Harrell, Robert M Califf, David B Pryor, Kerry L Lee, and Robert A Rosati. Evaluating the yield of medical tests. JAMA, 247(18):2543–2546, 1982. [73] Michael J Pencina, Ralph B D’Agostino Sr, Ralph B D’Agostino Jr, and Ramachandran S Vasan. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine, 27(2):157–172, 2008. [74] Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM Moons. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod) the TRIPOD statement. Circulation, 131(2):211–219, 2015. [75] Thomas R Dawber, William B Kannel, Nicholas Revotskie, Joseph Stokes III, Abraham Kagan, and Tavia Gordon. Some factors associated with the development of coronary heart disease—six years’ follow-up experience in the Framingham study. American Journal of Public Health and the Nations Health, 49(10):1349–1356, 1959. [76] William B Kannel, Thomas R Dawber, Abraham Kagan, Nicholas Revotskie, and JOSEPH STOKES III. Factors of risk in the development of coronary heart disease—six-year follow-up experience: the Framingham study. Annals of internal medicine, 55(1):33–50, 1961. [77] Jeanne Truett, Jerome Cornfield, and William Kannel. A multivariate analysis of the risk of coronary heart disease in Framingham. Journal of chronic diseases, 20(7):511–524, 1967. [78] Ancel Keys. Coronary heart disease in seven countries. Circulation, 41:I– 211, 1970. [79] Charlene F Belanger, Charles H Hennekens, Bernard Rosner, Frank E Speizer, et al. The nurses’ health study. Am J Nurs, 78(6):1039–1040, 1978. [80] Peter WF Wilson, Ralph B D’Agostino, Daniel Levy, Albert M Belanger, Halit Silbershatz, and William B Kannel. Prediction of coronary heart disease using risk factor categories. Circulation, 97(18):1837–1847, 1998. [81] Syed S Mahmood, Daniel Levy, Ramachandran S Vasan, and Thomas J Wang. The Framingham heart study and the epidemiology of cardiovascular disease: a historical perspective. The Lancet, 383(9921):999–1008, 2014. [82] Mitchell H Gail, Louise A Brinton, David P Byar, Donald K Corle, Sylvan B Green, Catherine Schairer, and John J Mulvihill. Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. JNCI: Journal of the National Cancer Institute, 81(24):1879–1886, 1989. [83] Julia Hippisley-Cox and Carol Coupland. Development and validation of QDiabetes-2018 risk prediction algorithm to estimate future risk of type 2 diabetes: cohort study. BMJ, 359, 2017. 92 References [84] Julia Hippisley-Cox and Carol Coupland. Development and validation of risk prediction algorithms to estimate future risk of common cancers in men and women: prospective cohort study. BMJ Open, 5(3):e007825, 2015. [85] Erkki Vartiainen, Tiina Laatikainen, Pekka Jousilahti, Markku Peltonen, Teemu Niiranen, and Veikko Salomaa. Sepelvaltimotaudin ja aivohalvauksen riskin arviointi FINRISKI 2.0-laskurilla. 2020. [86] David C Goff Jr, Donald M Lloyd-Jones, Glen Bennett, Sean Coady, Ralph B D’agostino, Raymond Gibbons, Philip Greenland, Daniel T Lackland, Daniel Levy, Christopher J O’donnell, et al. 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation, 129(25_suppl_2):S49–S73, 2014. [87] Xueli Yang, Jianxin Li, Dongsheng Hu, Jichun Chen, Ying Li, Jianfeng Huang, Xiaoqing Liu, Fangchao Liu, Jie Cao, Chong Shen, et al. Predicting the 10-year risks of atherosclerotic cardiovascular disease in Chinese population: the China-PAR project (prediction for ASCVD risk in china). Circulation, 134(19):1430–1440, 2016. Car[88] National Institute for Health and Care Excellence. diovascular disease: risk assessment and reduction, including lipid modification (NICE guideline [NG238]), 2023. https://www.nice.org.uk/guidance/ng238/chapter/Recommendationsstatinsfor-primary-prevention-of-cardiovascular-disease. Date accessed: 2024-0430. [89] Frank LJ Visseren, François Mach, Yvo M Smulders, David Carballo, Konstantinos C Koskinas, Maria Bäck, Athanase Benetos, Alessandro Biffi, José-Manuel Boavida, Davide Capodanno, et al. 2021 ESC Guidelines on cardiovascular disease prevention in clinical practice: Developed by the Task Force for cardiovascular disease prevention in clinical practice with representatives of the European Society of Cardiology and 12 medical societies With the special contribution of the European Association of Preventive Cardiology (EAPC). European heart journal, 42(34):3227–3337, 2021. [90] GBD 2015 Risk Factors Collaborators et al. Global, regional, and national comparative risk assessment of 79 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2015: a systematic analysis for the Global Burden of Disease Study 2015. Lancet (London, England), 388(10053):1659, 2016. [91] World Health Organization. Global spending on health 2020: weathering the storm. 2020. [92] John Robson, Isabel Dostal, Aziz Sheikh, Sandra Eldridge, Vichithranie Madurasinghe, Chris Griffiths, Carol Coupland, and Julia Hippisley-Cox. The NHS Health Check in England: an evaluation of the first 4 years. BMJ Open, 6(1):e008840, 2016. [93] NHS Health Check. [https://www.nhs.uk/conditions/nhs-health-check/]. Accessed September 16th 2024. , 2024. [94] Clare Bycroft, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T Elliott, Kevin Sharp, Allan Motyer, Damjan Vukcevic, Olivier Delaneau, Jared O’Connell, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209, 2018. 93 References [95] Mitja I Kurki, Juha Karjalainen, Priit Palta, Timo P Sipilä, Kati Kristiansson, Kati M Donner, Mary P Reeve, Hannele Laivuori, Mervi Aavikko, Mari A Kaunisto, et al. Finngen provides genetic insights from a wellphenotyped isolated population. Nature, 613(7944):508–518, 2023. [96] Mayo Blegen Ashley L. 18 Wirkus Samantha J. 18 Wagner Victoria A. 18 Meyer Jeffrey G. 18 Cicek Mine S. 10 18 Biobank and All of Us Research Demonstration Project Teams Choi Seung Hoan 14 http://orcid. org/00000002-0322-8970 Wang Xin 14 http://orcid. org/0000 0001-6042-4487 Rosenthal Elisabeth A. 15. Genomic data in the All of Us research program. Nature, 627(8003):340–346, 2024. [97] Iain S Forrest, Ben O Petrazzini, áine Duffy, Joshua K Park, Carla MarquezLuna, Daniel M Jordan, Ghislain Rocheleau, Judy H Cho, Robert S Rosenson, Jagat Narula, et al. Machine learning-based marker for coronary artery disease: derivation and validation in two longitudinal cohorts. The Lancet, 401(10372):215–225, 2023. [98] Ben O Petrazzini, Kumardeep Chaudhary, Carla Márquez-Luna, Iain S Forrest, Ghislain Rocheleau, Judy Cho, Jagat Narula, Girish Nadkarni, and Ron Do. Coronary risk estimation based on clinical data in electronic health records. Journal of the American College of Cardiology, 79(12):1155–1166, 2022. [99] Ruowang Li, Yong Chen, Marylyn D Ritchie, and Jason H Moore. Electronic health records and polygenic risk scores for predicting disease risk. Nature Reviews Genetics, 21(8):493–502, 2020. [100] Michael Inouye, Gad Abraham, Christopher P Nelson, Angela M Wood, Michael J Sweeting, Frank Dudbridge, Florence Y Lai, Stephen Kaptoge, Marta Brozynska, Tingting Wang, et al. Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology, 72(16):1883–1893, 2018. [101] Gad Abraham, Rainer Malik, Ekaterina Yonova-Doing, Agus Salim, Tingting Wang, John Danesh, Adam S Butterworth, Joanna MM Howson, Michael Inouye, and Martin Dichgans. Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nature Communications, 10(1):5819, 2019. [102] Atlas Khan, Michael C Turchin, Amit Patki, Vinodh Srinivasasainagendra, Ning Shang, Rajiv Nadukuru, Alana C Jones, Edyta Malolepsza, Ozan Dikilitas, Iftikhar J Kullo, et al. Genome-wide polygenic score to predict chronic kidney disease across ancestries. Nature Medicine, 28(7):1412–1420, 2022. [103] Tian Ge, Marguerite R Irvin, Amit Patki, Vinodh Srinivasasainagendra, Yen-Feng Lin, Hemant K Tiwari, Nicole D Armstrong, Barbara Benoit, ChiaYen Chen, Karmel W Choi, et al. Development and validation of a transancestry polygenic risk score for type 2 diabetes in diverse populations. Genome Medicine, 14(1):70, 2022. [104] Max Tamlander, Bradley Jermy, Toni T Seppälä, Martti Färkkilä, FinnGen, Elisabeth Widén, Samuli Ripatti, and Nina Mars. Genome-wide polygenic risk scores for colorectal cancer have implications for risk-based screening. British Journal of Cancer, 130(4):651–659, 2024. [105] Xin Yang, Siddhartha Kar, Antonis C Antoniou, and Paul DP Pharoah. Polygenic scores in cancer. Nature reviews Cancer, 23(9):619–630, 2023. 94 References [106] Rayjean J Hung, Matthew T Warkentin, Yonathan Brhane, Nilanjan Chatterjee, David C Christiani, Maria Teresa Landi, Neil E Caporaso, Geoffrey Liu, Mattias Johansson, Demetrius Albanes, et al. Assessing lung cancer absolute risk trajectory based on a polygenic risk model. Cancer Research, 81(6):1607–1615, 2021. [107] Nasim Mavaddat, Kyriaki Michailidou, Joe Dennis, Michael Lush, Laura Fachal, Andrew Lee, Jonathan P Tyrer, Ting-Huei Chen, Qin Wang, Manjeet K Bolla, et al. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics, 104(1):21–34, 2019. [108] Genevieve L Wojcik, Mariaelisa Graff, Katherine K Nishimura, Ran Tao, Jeffrey Haessler, Christopher R Gignoux, Heather M Highland, Yesha M Patel, Elena P Sorokin, Christy L Avery, et al. Genetic analyses of diverse populations improves discovery for complex traits. Nature, 570(7762):514– 518, 2019. [109] Julia Carrasco-Zanini, Maik Pietzner, Jonathan Davitte, Praveen Surendran, Damien C Croteau-Chonka, Chloe Robins, Ana Torralbo, Christopher Tomlinson, Florian Grünschläger, Natalie Fitzpatrick, et al. Proteomic signatures improve risk prediction for common and rare diseases. Nature Medicine, pages 1–10, 2024. [110] Jia You, Yu Guo, Yi Zhang, Ju-Jiao Kang, Lin-Bo Wang, Jian-Feng Feng, Wei Cheng, and Jin-Tai Yu. Plasma proteomic profiles predict individual future health risk. Nature Communications, 14(1):7817, 2023. [111] Danni A Gadd, Robert F Hillary, Zhana Kuncheva, Tasos Mangelis, Yipeng Cheng, Manju Dissanayake, Romi Admanit, Jake Gagnon, Tinchi Lin, Kyle L Ferber, et al. Blood protein assessment of leading incident diseases and mortality in the UK Biobank. Nature Aging, pages 1–10, 2024. [112] Oliver Fiehn. Metabolomics—the link between genotypes and phenotypes. Functional genomics, pages 155–171, 2002. [113] Elaine Holmes, Ruey Leng Loo, Jeremiah Stamler, Magda Bictash, Ivan KS Yap, Queenie Chan, Tim Ebbels, Maria De Iorio, Ian J Brown, Kirill A Veselkov, et al. Human metabolic phenotype diversity and its association with diet and blood pressure. Nature, 453(7193):396–400, 2008. [114] Aifric O’Sullivan, Michael J Gibney, and Lorraine Brennan. Dietary intake patterns are reflected in metabolomic profiles: potential role in dietary assessment studies. The American journal of clinical nutrition, 93(2):314– 321, 2011. [115] Rima Kaddurah-Daouk, Bruce S Kristal, and Richard M Weinshilboum. Metabolomics: a global biochemical approach to drug response and disease. Annu. Rev. Pharmacol. Toxicol., 48(1):653–683, 2008. [116] David S Wishart. Metabolomics for investigating physiological and pathophysiological processes. Physiological reviews, 99(4):1819–1875, 2019. [117] David S Wishart, AnChi Guo, Eponine Oler, Fei Wang, Afia Anjum, Harrison Peters, Raynard Dizon, Zinat Sayeeda, Siyang Tian, Brian L Lee, et al. HMDB 5.0: the human metabolome database for 2022. Nucleic acids research, 50(D1):D622–D631, 2022. [118] Aihua Zhang, Hui Sun, Ping Wang, Ying Han, and Xijun Wang. Recent and potential developments of biofluid analyses in metabolomics. Journal of proteomics, 75(4):1079–1088, 2012. 95 References [119] Abdul-Hamid M Emwas. The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research. Metabonomics: Methods and protocols, pages 161–193, 2015. [120] John L Markley, Rafael Brüschweiler, Arthur S Edison, Hamid R Eghbalnia, Robert Powers, Daniel Raftery, and David S Wishart. The future of NMRbased metabolomics. Current opinion in biotechnology, 43:34–40, 2017. [121] Jeramie D Watrous, Mir Henglin, Brian Claggett, Kim A Lehmann, Martin G Larson, Susan Cheng, and Mohit Jain. Visualization, quantification, and alignment of spectral drift in population scale untargeted metabolomics data. Analytical chemistry, 89(3):1399–1404, 2017. [122] GA Nagana Gowda and Danijel Djukovic. Overview of mass spectrometrybased metabolomics: opportunities and challenges. Mass Spectrometry in Metabolomics: Methods and Protocols, pages 3–12, 2014. [123] Amanda Rundblad, Jacob J Christensen, Kristin S Hustad, Nasser E Bastani, Inger Ottestad, Kirsten B Holven, and Stine M Ulven. Associations between dietary intake and glucose tolerance in clinical and metabolomicsbased metabotypes. Genes & nutrition, 18(1):3, 2023. [124] Yu Xu, Scott C Ritchie, Yujian Liang, Paul RHJ Timmers, Maik Pietzner, Loïc Lannelongue, Samuel A Lambert, Usman A Tahir, Sebastian MayWilson, Carles Foguet, et al. An atlas of genetic scores to predict multi-omic traits. Nature, 616(7955):123–131, 2023. [125] Minna K Karjalainen, Savita Karthikeyan, Clare Oliver-Williams, Eeva Sliz, Elias Allara, Wing Tung Fung, Praveen Surendran, Weihua Zhang, Pekka Jousilahti, Kati Kristiansson, et al. Genome-wide characterization of circulating metabolic biomarkers. Nature, 628(8006):130–138, 2024. [126] Luca A Lotta, Maik Pietzner, Isobel D Stewart, Laura BL Wittemans, Chen Li, Roberto Bonelli, Johannes Raffler, Emma K Biggs, Clare OliverWilliams, Victoria PW Auyeung, et al. Cross-platform genetic discovery of small molecule products of metabolism and application to clinical outcomes. Nature Genetics, 53(1):54, 2021. [127] Johannes Kettunen, Ayşe Demirkan, Peter Würtz, Harmen HM Draisma, Toomas Haller, Rajesh Rawal, Anika Vaarhorst, Antti J Kangas, Leo-Pekka Lyytikäinen, Matti Pirinen, et al. Genome-wide study for circulating metabolites identifies 62 loci and reveals novel systemic effects of LPA. Nature Communications, 7(1):11122, 2016. [128] Joshua A Bell, Tom G Richardson, Qin Wang, Eleanor Sanderson, Tom Palmer, Venexia Walker, Linda M O’Keeffe, Nicholas J Timpson, Anna Cichonska, Heli Julkunen, et al. Effects of general and central adiposity on circulating lipoprotein, lipid, and metabolite levels in UK Biobank: a multivariable Mendelian randomization study. The Lancet Regional Health– Europe, 21, 2022. [129] Jiarui Mi, Lingjuan Jiang, Zhengye Liu, Xia Wu, Nan Zhao, Yuanzhuo Wang, and Xiaoyin Bai. Identification of blood metabolites linked to the risk of cholelithiasis: a comprehensive Mendelian randomization study. Hepatology International, 16(6):1484–1493, 2022. [130] Maria Carolina Borges, Philip C Haycock, Jie Zheng, Gibran Hemani, Michael V Holmes, George Davey Smith, Aroon D Hingorani, and Deborah A Lawlor. Role of circulating polyunsaturated fatty acids on cardiovascular diseases risk: analysis using Mendelian randomization and fatty acid genetic association data from over 114,000 UK Biobank participants. BMC Medicine, 20(1):210, 2022. 96 References [131] Thomas J Wang, Martin G Larson, Ramachandran S Vasan, Susan Cheng, Eugene P Rhee, Elizabeth McCabe, Gregory D Lewis, Caroline S Fox, Paul F Jacques, Céline Fernandez, et al. Metabolite profiles and the risk of developing diabetes. Nature Medicine, 17(4):448–453, 2011. [132] Anna Floegel, Norbert Stefan, Zhonghao Yu, Kristin Mühlenbruch, Dagmar Drogan, Hans-Georg Joost, Andreas Fritsche, Hans-Ulrich Häring, Martin Hrabě de Angelis, Annette Peters, et al. Identification of serum metabolites associated with risk of type 2 diabetes using a targeted metabolomic approach. Diabetes, 62(2):639–648, 2013. [133] Peter Würtz, Pasi Soininen, Antti J Kangas, Tapani Rönnemaa, Terho Lehtimäki, Mika Kähönen, Jorma S Viikari, Olli T Raitakari, and Mika Ala-Korpela. Branched-chain and aromatic amino acids are predictors of insulin resistance in young adults. Diabetes Care, 36(3):648–655, 2013. [134] Svati H Shah, William E Kraus, and Christopher B Newgard. Metabolomic profiling for the identification of novel biomarkers and mechanisms related to common cardiovascular diseases: form and function. Circulation, 126(9):1110–1120, 2012. [135] Christin Stegemann, Raimund Pechlaner, Peter Willeit, Sarah R Langley, Massimo Mangino, Ursula Mayr, Cristina Menni, Alireza Moayyeri, Peter Santer, Gregor Rungger, et al. Lipidomics profiling and risk of cardiovascular disease in the prospective population-based Bruneck study. Circulation, 129(18):1821–1831, 2014. [136] Peter Würtz, Juho R Raiko, Costan G Magnussen, Pasi Soininen, Antti J Kangas, Tuulia Tynkkynen, Russell Thomson, Reino Laatikainen, Markku J Savolainen, Jari Laurikka, et al. High-throughput quantification of circulating metabolites improves prediction of subclinical atherosclerosis. European heart journal, 33(18):2307–2316, 2012. [137] Peter Würtz, Aki S Havulinna, Pasi Soininen, Tuulia Tynkkynen, David Prieto-Merino, Therese Tillin, Anahita Ghorbani, Anna Artati, Qin Wang, Mika Tiainen, et al. Metabolite profiling and cardiovascular event risk: a prospective study of 3 population-based cohorts. Circulation, 131(9):774– 785, 2015. [138] Ari V Ahola-Olli, Linda Mustelin, Maria Kalimeri, Johannes Kettunen, Jari Jokelainen, Juha Auvinen, Katri Puukka, Aki S Havulinna, Terho Lehtimäki, Mika Kähönen, et al. Circulating metabolites and the risk of type 2 diabetes: a prospective study of 11,896 young adults from four Finnish cohorts. Diabetologia, 62:2298–2309, 2019. [139] MC Borges, AF Schmidt, B Jefferis, SG Wannamethee, DA Lawlor, M Kivimaki, et al. Circulating fatty acids and risk of coronary heart disease and stroke: individual participant data meta-analysis in up to 16 126 participants. Journal of the American Heart Association, 2020. [140] Emmi Tikkanen, Vilma Jägerroos, Michael V Holmes, Naveed Sattar, Mika Ala-Korpela, Pekka Jousilahti, Annamari Lundqvist, Markus Perola, Veikko Salomaa, and Peter Würtz. Metabolic biomarker discovery for risk of peripheral artery disease compared with coronary artery disease: lipoprotein and metabolite profiling of 31 657 individuals from 5 prospective cohorts. Journal of the American Heart Association, 10(23):e021995, 2021. [141] Michael V Holmes, Iona Y Millwood, Christiana Kartsonaki, Michael R Hill, Derrick A Bennett, Ruth Boxall, Yu Guo, Xin Xu, Zheng Bian, Ruying Hu, et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction and stroke. Journal of The American college of cardiology, 71(6):620–632, 2018. 97 References [142] Krista Fischer, Johannes Kettunen, Peter Würtz, Toomas Haller, Aki S Havulinna, Antti J Kangas, Pasi Soininen, Tonu Esko, Mari-Liis Tammesoo, Reedik Mägi, et al. Biomarker profiling by nuclear magnetic resonance spectroscopy for the prediction of all-cause mortality: an observational study of 17,345 persons. PLoS Medicine, 11(2):e1001606, 2014. [143] Scott C Ritchie, Peter Würtz, Artika P Nath, Gad Abraham, Aki S Havulinna, Liam G Fearnley, Antti-Pekka Sarin, Antti J Kangas, Pasi Soininen, Kristiina Aalto, et al. The biomarker GlycA is associated with chronic inflammation and predicts long-term risk of severe infection. Cell Systems, 1(4):293–301, 2015. [144] Johannes Kettunen, Scott C Ritchie, Olga Anufrieva, Leo-Pekka Lyytikäinen, Jussi Hernesniemi, Pekka J Karhunen, Pekka Kuukasjärvi, Jari Laurikka, Mika Kähönen, Terho Lehtimäki, et al. Biomarker glycoprotein acetyls is associated with the risk of a wide spectrum of incident diseases and stratifies mortality risk in angiography patients. Circulation: Genomic and Precision Medicine, 11(11):e002234, 2018. [145] Juho Tynkkynen, Vincent Chouraki, Sven J van der Lee, Jussi Hernesniemi, Qiong Yang, Shuo Li, Alexa Beiser, Martin G Larson, Katri Sääksjärvi, Martin J Shipley, et al. Association of branched-chain amino acids and other circulating metabolites with risk of incident dementia and Alzheimer’s disease: a prospective study in eight cohorts. Alzheimer’s & Dementia, 14(6):723–733, 2018. [146] Sven J van der Lee, Charlotte E Teunissen, René Pool, Martin J Shipley, Alexander Teumer, Vincent Chouraki, Debora Melo van Lent, Juho Tynkkynen, Krista Fischer, Jussi Hernesniemi, et al. Circulating metabolites and general cognitive ability and dementia: Evidence from 11 cohort studies. Alzheimer’s & Dementia, 14(6):707–722, 2018. [147] Lucie Lécuyer, Agnès Victor Bala, Mélanie Deschasaux, Nadia Bouchemal, Mohamed Nawfal Triba, Marie-Paule Vasson, Adrien Rossary, Aicha Demidem, Pilar Galan, Serge Hercberg, et al. NMR metabolomic signatures reveal predictive plasma metabolites associated with long-term risk of developing breast cancer. International Journal of Epidemiology, 47(2):484–494, 2018. [148] Päivi Sirniö, Juha P Väyrynen, Kai Klintrup, Jyrki Mäkelä, Markus J Mäkinen, Tuomo J Karttunen, and Anne Tuomisto. Decreased serum apolipoprotein A1 levels are associated with poor survival and systemic inflammatory response in colorectal cancer. Scientific Reports, 7(1):5374, 2017. [149] Jesse Fest, Lisanne S Vijfhuizen, Jelle J Goeman, Olga Veth, Anni Joensuu, Markus Perola, Satu Männistö, Eivind Ness-Jensen, Kristian Hveem, Toomas Haller, et al. Search for early pancreatic cancer blood biomarkers in five European prospective population biobanks using metabolomics. Endocrinology, 160(7):1731–1742, 2019. [150] Jared R Mayers, Chen Wu, Clary B Clish, Peter Kraft, Margaret E Torrence, Brian P Fiske, Chen Yuan, Ying Bao, Mary K Townsend, Shelley S Tworoger, et al. Elevation of circulating branched-chain amino acids is an early event in human pancreatic adenocarcinoma development. Nature Medicine, 20(10):1193–1198, 2014. [151] Thore Buergel, Jakob Steinfeldt, Greg Ruyoga, Maik Pietzner, Daniele Bizzarri, Dina Vojinovic, Julius Upmeier zu Belzen, Lukas Loock, Paul Kittner, Lara Christmann, et al. Metabolomic profiles predict individual multidisease outcomes. Nature Medicine, 28(11):2309–2320, 2022. 98 References [152] Fiona Bragg, Eirini Trichia, Diego Aguilar-Ramirez, Jelena Bešević, Sarah Lewington, and Jonathan Emberson. Predictive value of circulating NMR metabolic biomarkers for type 2 diabetes risk in the UK Biobank study. BMC Medicine, 20(1):159, 2022. [153] Xinyu Zhang, Wenyi Hu, Yueye Wang, Wei Wang, Huan Liao, Xiayin Zhang, Katerina V Kiburg, Xianwen Shang, Gabriella Bulloch, Yu Huang, et al. Plasma metabolomic profiles of dementia: a prospective study of 110,655 participants in the UK Biobank. BMC Medicine, 20(1):252, 2022. [154] Yi-Xuan Qiang, Jia You, Xiao-Yu He, Yu Guo, Yue-Ting Deng, Pei-Yang Gao, Xin-Rui Wu, Jian-Feng Feng, Wei Cheng, and Jin-Tai Yu. Plasma metabolic profiles predict future dementia and dementia subtypes: a prospective analysis of 274,160 participants. Alzheimer’s Research & Therapy, 16(1):16, 2024. [155] Rafael R Oexner, Hyunchan Ahn, Konstantinos Theofilatos, Ravi A Shah, Robin Schmitt, Philip Chowienczyk, Anna Zoccarato, and Ajay M Shah. Serum metabolomics improves risk stratification for incident heart failure. European Journal of Heart Failure, 26(4):829–840, 2024. [156] Zhening Liu, Hangkai Huang, Jiarong Xie, Yingying Xu, and Chengfu Xu. Circulating fatty acids and risk of hepatocellular carcinoma and chronic liver disease mortality in the UK Biobank. Nature Communications, 15(1):3707, 2024. [157] Wenyi Hu, Wei Wang, Huan Liao, Gabriella Bulloch, Xiayin Zhang, Xianwen Shang, Yu Huang, Yijun Hu, Honghua Yu, Xiaohong Yang, et al. Metabolic profiling reveals circulating biomarkers associated with incident and prevalent Parkinson’s disease. npj Parkinson’s Disease, 10(1):130, 2024. [158] Shiyu Zhang, Zheng Wang, Yijing Wang, Yixiao Zhu, Qiao Zhou, Xingxing Jian, Guihu Zhao, Jian Qiu, Kun Xia, Beisha Tang, et al. A metabolomic profile of biological aging in 250,341 individuals from the UK Biobank. Nature Communications, 15(1):8081, 2024. [159] Eva S Istvan and Johann Deisenhofer. Structural mechanism for statin inhibition of HMG-CoA reductase. Science, 292(5519):1160–1164, 2001. [160] Peter Imming, Christian Sinning, and Achim Meyer. Drugs, their targets and the nature and number of drug targets. Nature Reviews Drug Discovery, 5(10):821–834, 2006. [161] Andrew Anighoro, Jurgen Bajorath, and Giulio Rastelli. Polypharmacology: challenges and opportunities in drug discovery: miniperspective. Journal of medicinal chemistry, 57(19):7874–7887, 2014. [162] Grant R Zimmermann, Joseph Lehar, and Curtis T Keith. Multi-target therapeutics: when the whole is greater than the sum of the parts. Drug Discovery Today, 12(1-2):34–42, 2007. [163] Zachary A Knight, Henry Lin, and Kevan M Shokat. Targeting the cancer kinome through polypharmacology. Nature Reviews Cancer, 10(2):130–137, 2010. [164] Pranita D Tamma, Sara E Cosgrove, and Lisa L Maragakis. Combination therapy for treatment of infections with gram-negative bacteria. Clinical microbiology reviews, 25(3):450–470, 2012. [165] Roberta J Worthington and Christian Melander. Combination approaches to combat multidrug-resistant bacteria. Trends in biotechnology, 31(3):177– 184, 2013. 99 References [166] Tea Pemovska, Johannes W Bigenzahn, and Giulio Superti-Furga. Recent advances in combinatorial drug screening and synergy scoring. Current opinion in pharmacology, 42:102–110, 2018. [167] Alan Sandler, Robert Gray, Michael C Perry, Julie Brahmer, Joan H Schiller, Afshin Dowlati, Rogerio Lilenbaum, and David H Johnson. Paclitaxel– carboplatin alone or with bevacizumab for non–small-cell lung cancer. New England Journal of Medicine, 355(24):2542–2550, 2006. [168] M Reck, J Von Pawel, P von Zatloukal, R Ramlau, V Gorbounova, V Hirsh, N Leighl, J Mezger, V Archer, N Moore, et al. Overall survival with cisplatin–gemcitabine and bevacizumab or placebo as first-line therapy for nonsquamous non-small-cell lung cancer: results from a randomised phase iii trial (AVAiL). Annals of Oncology, 21(9):1804–1809, 2010. [169] James Larkin, Paolo A Ascierto, Brigitte Dréno, Victoria Atkinson, Gabriella Liszkay, Michele Maio, Mario Mandalà, Lev Demidov, Daniil Stroyakovskiy, Luc Thomas, et al. Combined vemurafenib and cobimetinib in BRAF-mutated melanoma. New England Journal of Medicine, 371(20):1867–1876, 2014. [170] Joseph Lehár, Andrew S Krueger, William Avery, Adrian M Heilbut, Lisa M Johansen, E Roydon Price, Richard J Rickles, Glenn F Short Iii, Jane E Staunton, Xiaowei Jin, et al. Synergistic drug combinations tend to improve therapeutically relevant selectivity. Nature Biotechnology, 27(7):659–666, 2009. [171] Jonathan B Fitzgerald, Birgit Schoeberl, Ulrik B Nielsen, and Peter K Sorger. Systems biology and combination therapy in the quest for clinical efficacy. Nature chemical biology, 2(9):458–466, 2006. [172] Sreenath V Sharma, Daniel A Haber, and Jeff Settleman. Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nature reviews cancer, 10(4):241–253, 2010. [173] Jean-Pierre Gillet, Sudhir Varma, and Michael M Gottesman. The clinical relevance of cancer cell lines. Journal of the National Cancer Institute, 105(7):452–458, 2013. [174] Christiaan Klijn, Steffen Durinck, Eric W Stawiski, Peter M Haverty, Zhaoshi Jiang, Hanbin Liu, Jeremiah Degenhardt, Oleg Mayba, Florian Gnad, Jinfeng Liu, et al. A comprehensive transcriptional portrait of human cancer cell lines. Nature Biotechnology, 33(3):306–312, 2015. [175] Berenbaum Me. What is synergy. Pharmacoligal Reviews, 41:93–141, 1989. [176] CI Bliss. The toxicity of poisons applied jointly. Annals of Applied Biology, 26(3):585–615, 1939. [177] S Loewe. The problem of synergism and antagonism of combined drugs. Arzneimittelforschung, 3:285–290, 1953. [178] Seyed Ali Madani Tonekaboni, Laleh Soltan Ghoraie, Venkata Satya Kumar Manem, and Benjamin Haibe-Kains. Predictive approaches for drug combination discovery in cancer. Briefings in Bioinformatics, 19(2):263–276, 2018. [179] J Yang, H Tang, Y Li, R Zhong, T Wang, STC Wong, G Xiao, and Y Xie. DIGRE: Drug-induced genomic residual effect model for successful prediction of multidrug effects. CPT: Pharmacometrics & Systems Pharmacology, 4(2):91–97, 2015. 100 References [180] JH Lee, DG Kim, TJ Bae, K Rho, JT Kim, et al. CDA: Combinatorial drug discovery using transcriptional response modules. PLoS ONE, 7(8), 2012. [181] Ralph G Zinner, Brittany L Barrett, Elmira Popova, Paul Damien, Andrei Y Volgin, Juri G Gelovani, Reuben Lotan, Hai T Tran, Claudio Pisano, Gordon B Mills, et al. Algorithmic guided screening of drug combinations of arbitrary size for activity against cancer cells. Molecular cancer therapeutics, 8(3):521–532, 2009. [182] Pak Kin Wong, Fuqu Yu, Arash Shahangian, Genhong Cheng, Ren Sun, and Chih-Ming Ho. Closed-loop control of cellular functions using combinatory drugs guided by a stochastic search algorithm. Proceedings of the National Academy of Sciences, 105(13):5105–5110, 2008. [183] Zikai Wu, Xing-Ming Zhao, and Luonan Chen. A systems biology approach to identify effective cocktail drugs. In BMC Systems Biology, volume 4, pages 1–14. Springer, 2010. [184] Kelly E Regan-Fendt, Jielin Xu, Mallory DiVincenzo, Megan C Duggan, Reena Shakya, Ryejung Na, William E Carson III, Philip RO Payne, and Fuhai Li. Synergy from gene expression and network mining (syngenet) method predicts synergistic drug combinations for diverse melanoma genomic subtypes. NPJ systems biology and applications, 5(1):6, 2019. [185] Feixiong Cheng, István A Kovács, and Albert-László Barabási. Networkbased prediction of drug combinations. Nature Communications, 10(1):1197, 2019. [186] Bulat Zagidullin, Jehad Aldahdooh, Shuyu Zheng, Wenyu Wang, Yinyin Wang, Joseph Saad, Alina Malyutina, Mohieddin Jafari, Ziaurrehman Tanoli, Alberto Pessia, et al. DrugComb: an integrative cancer drug combination data portal. Nucleic acids research, 47(W1):W43–W51, 2019. [187] Heewon Seo, Denis Tkachuk, Chantal Ho, Anthony Mammoliti, Aria Rezaie, Seyed Ali Madani Tonekaboni, and Benjamin Haibe-Kains. SYNERGxDB: an integrative pharmacogenomic portal to identify synergistic drug combinations for precision oncology. Nucleic acids research, 48(W1):W494–W501, 2020. [188] Jennifer O’Neil, Yair Benita, Igor Feldman, Melissa Chenard, Brian Roberts, Yaping Liu, Jing Li, Astrid Kral, Serguei Lejnine, Andrey Loboda, et al. An unbiased oncology compound screen to identify novel combination strategies. Molecular cancer therapeutics, 15(6):1155–1162, 2016. [189] Yifan Sun, Yi Xiong, Qian Xu, and Dongqing Wei. A hadoop-based method to predict potential effective drug combination. BioMed research international, 2014(1):196858, 2014. [190] Hongyang Li, Tingyang Li, Daniel Quang, and Yuanfang Guan. Network propagation predicts drug synergy in cancers. Cancer Research, 78(18):5446–5457, 2018. [191] Kaitlyn M Gayvert, Omar Aly, James Platt, Marcus W Bosenberg, David F Stern, and Olivier Elemento. A computational approach for identifying synergistic drug combinations. PLoS computational biology, 13(1):e1005308, 2017. [192] Pavel Sidorov, Stefan Naulaerts, Jérémy Ariey-Bonnet, Eddy Pasquier, and Pedro J Ballester. Predicting synergism of cancer drug combinations using NCI-ALMANAC data. Frontiers in chemistry, 7:509, 2019. 101 References [193] Remzi Celebi, Oliver Bear Don’t Walk IV, Rajiv Movva, Semih Alpsoy, and Michel Dumontier. In-silico prediction of synergistic anti-cancer drug combinations using multi-omics data. Scientific Reports, 9(1):8949, 2019. [194] Jian-Yu Shi, Jia-Xin Li, Ke Gao, Peng Lei, and Siu-Ming Yiu. Predicting combinative drug pairs towards realistic screening via integrating heterogeneous features. BMC Bioinformatics, 18:1–9, 2017. [195] Fangfang Xia, Maulik Shukla, Thomas Brettin, Cristina Garcia-Cardona, Judith Cohn, Jonathan E Allen, Sergei Maslov, Susan L Holbeck, James H Doroshow, Yvonne A Evrard, et al. Predicting tumor cell line response to drug pairs with deep learning. BMC bioinformatics, 19:71–79, 2018. [196] Peiran Jiang, Shujun Huang, Zhenyuan Fu, Zexuan Sun, Ted M Lakowski, and Pingzhao Hu. Deep graph embedding for prioritizing synergistic anticancer drug combinations. Computational and structural biotechnology journal, 18:427–438, 2020. [197] Yejin Kim, Shuyu Zheng, Jing Tang, Wenjin Jim Zheng, Zhao Li, and Xiaoqian Jiang. Anticancer drug synergy prediction in understudied tissues using transfer learning. Journal of the American Medical Informatics Association, 28(1):42–51, 2021. [198] Qiao Liu and Lei Xie. TranSynergy: Mechanism-driven interpretable deep neural network for the synergistic prediction and pathway deconvolution of drug combinations. PLoS computational biology, 17(2):e1008653, 2021. [199] Halil Ibrahim Kuru, Oznur Tastan, and A Ercument Cicek. MatchMaker: a deep learning framework for drug synergy prediction. IEEE/ACM transactions on computational biology and bioinformatics, 19(4):2334–2344, 2021. [200] Jinxian Wang, Xuejun Liu, Siyuan Shen, Lei Deng, and Hui Liu. DeepDDS: deep graph neural network with attention mechanism to predict synergistic drug combinations. Briefings in Bioinformatics, 23(1):bbab390, 2022. [201] Tianyu Zhang, Liwei Zhang, Philip RO Payne, and Fuhai Li. Synergistic drug combination prediction by integrating multiomics data in deep learning models. Translational bioinformatics for therapeutic development, pages 223–238, 2021. [202] Kristina Preuer, Richard PI Lewis, Sepp Hochreiter, Andreas Bender, Krishna C Bulusu, and Günter Klambauer. DeepSynergy: predicting anticancer drug synergy with deep learning. Bioinformatics, 34(9):1538–1546, 2018. [203] Kunjie Fan, Lijun Cheng, and Lang Li. Artificial intelligence and machine learning methods in predicting anti-cancer drug combination effects. Briefings in Bioinformatics, 22(6):bbab271, 2021. [204] Delora Baptista, Pedro G Ferreira, and Miguel Rocha. Deep learning for drug response prediction in cancer. Briefings in Bioinformatics, 22(1):360– 379, 2021. [205] Yurui Chen and Louxin Zhang. How much can deep learning improve prediction of the responses to drugs in cancer cell lines? Briefings in Bioinformatics, 23(1):bbab378, 2022. [206] Michael P Menden, Dennis Wang, Mike J Mason, Bence Szalai, Krishna C Bulusu, Yuanfang Guan, Thomas Yu, Jaewoo Kang, Minji Jeon, Russ Wolfinger, et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nature Communications, 10(1):2674, 2019. 102 References [207] Mukesh Bansal, Jichen Yang, Charles Karan, Michael P Menden, James C Costello, Hao Tang, Guanghua Xiao, Yajuan Li, Jeffrey Allen, Rui Zhong, et al. A community computational challenge to predict the activity of pairs of compounds. Nature Biotechnology, 32(12):1213–1222, 2014. [208] Julio Saez-Rodriguez, James C Costello, Stephen H Friend, Michael R Kellen, Lara Mangravite, Pablo Meyer, Thea Norman, and Gustavo Stolovitzky. Crowdsourcing biomedical research: leveraging communities as innovation engines. Nature Reviews Genetics, 17(8):470–486, 2016. [209] Abraham Wald. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical society, 54(3):426–482, 1943. [210] David A Grimes and Kenneth F Schulz. Bias and causal associations in observational research. The Lancet, 359(9302):248–252, 2002. [211] Philip Sedgwick. Bias in observational study designs: prospective cohort studies. BMJ, 349, 2014. [212] Galit Shmueli. To explain or to predict? Statistical Science, 25(3):289–310, 2010. [213] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970. [214] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996. [215] Steffen Rendle. Factorization machines. In 2010 IEEE International conference on data mining, pages 995–1000. IEEE, 2010. [216] Steffen Rendle, Zeno Gantner, Christoph Freudenthaler, and Lars SchmidtThieme. Fast context-aware recommendations with factorization machines. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 635–644, 2011. [217] Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. Fieldaware factorization machines for CTR prediction. In Proceedings of the 10th ACM conference on recommender systems, pages 43–50, 2016. [218] Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. Field-weighted factorization machines for clickthrough rate prediction in display advertising. In Proceedings of the 2018 world wide web conference, pages 1349–1357, 2018. [219] Zhulin Tao, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. HoAFM: a high-order attentive factorization machine for ctr prediction. Information Processing & Management, 57(6):102076, 2020. [220] Giulia Grazia, Ilaria Penna, Valentina Perotti, Andrea Anichini, and Elena Tassi. Towards combinatorial targeted therapy in melanoma: from preclinical evidence to clinical application. International journal of oncology, 45(3):929–949, 2014. [221] Amila Suraweera, Kenneth J O’Byrne, and Derek J Richard. Combination therapy with histone deacetylase inhibitors (HDACi) for the treatment of cancer: achieving the full therapeutic potential of HDACi. Frontiers in oncology, 8:92, 2018. 103 References [222] Frederick K Ho, Carlos A Celis-Morales, Stuart R Gray, S Vittal Katikireddi, Claire L Niedzwiedz, Claire Hastie, Lyn D Ferguson, Colin Berry, Daniel F Mackay, Jason MR Gill, et al. Modifiable and non-modifiable risk factors for covid-19, and comparison to risk factors for influenza and pneumonia: results from a UK Biobank prospective cohort study. BMJ Open, 10(11):e040402, 2020. [223] Christy L Avery, Annie Green Howard, Harold H Lee, Carolina G Downie, Moa P Lee, Sarah H Koenigsberg, Anna F Ballou, Michael H Preuss, Laura M Raffield, Rina A Yarosh, et al. Branched chain amino acids harbor distinct and often opposing effects on health and disease. Communications Medicine, 3(1):172, 2023. [224] Mai Yuanbang Zheng Yuanting Yu, Ying and Leming Shi. Assessing and mitigating batch effects in large-scale omics studies. Genome Biology, 25(1), 2024. [225] Bart JA Mertens. Transformation, normalization, and batch effect in the analysis of mass spectrometry data for omics studies. Statistical Analysis of Proteomics, Metabolomics, and Lipidomics Data Using Mass Spectrometry, page 1, 2016. [226] Dennis T Ko, Atul Sivaswamy, Maneesh Sud, Gynter Kotrri, Paymon Azizi, Maria Koh, Peter C Austin, Douglas S Lee, Idan Roifman, George Thanassoulis, et al. Calibration and discrimination of the Framingham risk score and the pooled cohort equations. Cmaj, 192(17):E442–E449, 2020. [227] Ruth E Parsons, Xiaonan Liu, Jennifer A Collister, David A Clifton, Benjamin J Cairns, and Lei Clifton. Independent external validation of the QRISK cardiovascular disease risk prediction model using UK Biobank. Heart, 109(22):1690–1697, 2023. [228] Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 1970. [229] Roger Fletcher. A new approach to variable metric algorithms. The Computer Journal, 13(3):317–322, 1970. [230] D Goldfarb. A family of variable metric updates derived by variational means, v. 24. Mathematics of Computation, pages 21–55, 1970. [231] David F Shanno. Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24(111):647–656, 1970. [232] Anthony Letai. Functional precision cancer medicine—moving beyond pure genomics. Nature Medicine, 23(9):1028–1035, 2017. [233] Christoph Kornauth, Tea Pemovska, Gregory I Vladimer, Günther Bayer, Michael Bergmann, Sandra Eder, Ruth Eichner, Martin Erl, Harald Esterbauer, Ruth Exner, et al. Functional precision medicine provides clinical benefit in advanced aggressive hematologic cancers and identifies exceptional responders. Cancer discovery, 12(2):372–387, 2022. [234] Jeffrey W Tyner, Cristina E Tognon, Daniel Bottomly, Beth Wilmot, Stephen E Kurtz, Samantha L Savage, Nicola Long, Anna Reister Schultz, Elie Traer, Melissa Abel, et al. Functional genomic landscape of acute myeloid leukaemia. Nature, 562(7728):526–531, 2018. [235] Tianduanyi Wang, Sandor Szedmak, Haishan Wang, Tero Aittokallio, Tapio Pahikkala, Anna Cichonska, and Juho Rousu. Modeling drug combination effects via latent tensor reconstruction. Bioinformatics, 37(Supplement_1):i93–i101, 2021. 104 References [236] Abdulkadir Elmas, Kevin Spehar, Ron Do, Joseph M Castellano, and Kuanlin Huang. Associations of circulating biomarkers with disease risks: A twosample Mendelian randomization study. International Journal of Molecular Sciences, 25(13):7376, 2024. [237] Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779, 2015. [238] Anna Fry, Thomas J Littlejohns, Cathie Sudlow, Nicola Doherty, Ligia Adamska, Tim Sprosen, Rory Collins, and Naomi E Allen. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American Journal of Epidemiology, 186(9):1026–1034, 2017. 105 Publication I Heli Julkunen, Anna Cichońska, Prson Gautam, Sandor Szedmak, Jane Douat, Tapio Pahikkala, Tero Aittokallio, Juho Rousu. Leveraging multiway interactions for systematic prediction of pre-clinical drug combination effects. Nature Communications, December 2020. © 2020 The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 107 ARTICLE https://doi.org/10.1038/s41467-020-19950-z OPEN Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects 1234567890():,; Heli Julkunen 1, Anna Cichonska 1,2,3, Prson Gautam 3, Sandor Szedmak Tapio Pahikkala 2, Tero Aittokallio 1,3,4,5,6 ✉ & Juho Rousu 1 ✉ 1, Jane Douat1, We present comboFM, a machine learning framework for predicting the responses of drug combinations in pre-clinical studies, such as those based on cell lines or patient-derived cells. comboFM models the cell context-specific drug interactions through higher-order tensors, and efficiently learns latent factors of the tensor using powerful factorization machines. The approach enables comboFM to leverage information from previous experiments performed on similar drugs and cells when predicting responses of new combinations in so far untested cells; thereby, it achieves highly accurate predictions despite sparsely populated data tensors. We demonstrate high predictive performance of comboFM in various prediction scenarios using data from cancer cell line pharmacogenomic screens. Subsequent experimental validation of a set of previously untested drug combinations further supports the practical and robust applicability of comboFM. For instance, we confirm a novel synergy between anaplastic lymphoma kinase (ALK) inhibitor crizotinib and proteasome inhibitor bortezomib in lymphoma cells. Overall, our results demonstrate that comboFM provides an effective means for systematic pre-screening of drug combinations to support precision oncology applications. 1 Department of Computer Science, Helsinki Institute for Information Technology HIIT, Aalto University, Espoo, Finland. 2 Department of Future Technologies, University of Turku, Turku, Finland. 3 Institute for Molecular Medicine Finland FIMM, University of Helsinki, Helsinki, Finland. 4 Department of Mathematics and Statistics, University of Turku, Turku, Finland. 5 Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital, Oslo, Norway. 6 Oslo Centre for Biostatistics and Epidemiology, University of Oslo, Oslo, Norway. ✉email: tero.aittokallio@helsinki.fi; juho.rousu@aalto.fi NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 1 ARTICLE C NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z ombination therapies are often required for treating cancer patients with advanced stages of the disease. In addition to overcoming monotherapy resistance, combinatorial treatments can also reduce toxicity of the treatment (by reduced doses of the drugs) and improve therapeutic efficacy (by multi-targeting effect)1–3. With recent advances in high-throughput screening methods, a systematic evaluation of combinations among large collections of chemical compounds has become feasible. This typically leads to large-scale experiments, in which the combinatorial responses are tested in various doses of the individual compounds, resulting in dose–response matrices that capture the measured combination effects for every concentration pair in a particular sample (e.g., cancer cell line or patient-derived cells)4. However, even with modern high-throughput instruments, experimental screening of drug combinations quickly becomes impractical, as the number of conceivable drug combinations increases rapidly with the number of drugs in consideration. In addition, the inherent heterogeneity of cancer cells pose further challenges for the experimental efforts, as the combinations need to be tested in various cell contexts and genomic backgrounds5,6. Therefore, computational methods are often being used to guide the discovery of effective combinations to be prioritized for further pre-clinical and clinical validation7,8. During the recent years, machine learning has emerged as a powerful approach to aid the drug development process by offering systematic means for the prediction of target bioactivities and drug-induced effects9–13, thereby providing guidance for drug discovery and repositioning efforts14,15. Until recently, the performance of machine learning methods in predicting drug combination effects was limited by the lack of high-quality training data8. However, this is gradually changing as increasing amounts of data from pre-clinical drug combination screens are becoming available, therefore creating new opportunities also for the application of large-scale machine learning methods4,5,16. For instance, the NCI-ALMANAC dataset generated by the US National Cancer Institute (NCI) provides over 3 million experimentally measured drug combination responses across various cell lines and tissue types4. However, despite the potential value of such datasets, the high dimensionality of the underlying dose–response data and the inherent complexity of drug interaction patterns across various doses pose challenges to accurate modeling of drug combination effects. Several computational tools have been proposed for the prediction of drug combinations2,7,8,17. Many of these tools have been systematically benchmarked in two crowdsourced DREAM Challenge competitions18,19, which demonstrated that computational predictions can achieve high accuracies for selected drug classes, provided there are enough drug information and training data available. However, the focus of these challenges and most of the previously proposed methods has been on directly predicting drug combination synergies (i.e., whether the combined summary effect is higher than expected). In many practical applications, however, more detailed information on dose–response effects of the combinations is required, rather than simply classifying the summary effects into synergistic or antagonistic classes. Furthermore, as noted in the recent AstraZeneca-Sanger drug combination prediction DREAM challenge19, the performance of the computational methods typically relies on selective incorporation of target features and biological knowledge that is not always available for all drugs and cell models. Therefore, there is a need to develop integrative and robust models capable of generalizing and learning from large amounts of available data that facilitate the exploration of the extensive combinatorial drug and dose spaces. 2 Here, we present comboFM, a novel machine learning framework for systematic modeling of drug-dose combination effects in a cell context-specific manner. It is generally applicable to any pre-clinical model systems, such as patient-derived primary cells, but we demonstrate its performance here in cancer cell lines (Fig. 1). We base our work on the observation that the drug combination dose–response data can be compiled into a higherorder tensor indexed by drugs, drug concentrations, and cell lines. comboFM then models the cell line-specific responses to a combination of drugs as an interaction between the different modes of the tensor using a higher-order factorization machine (FM)20, a recently proposed machine learning approach for nonlinear learning on large data. FMs have been shown to be compelling tools with the ability to work particularly well with highdimensional and sparse datasets20–22. In contrast to existing machine learning models, comboFM enables one to explore the detailed landscape of drug combination responses across various doses. We demonstrate that comboFM obtains high prediction accuracy in various practical application scenarios, significantly outperforming other approaches. Furthermore, we show the robustness and practical potential of comboFM by experimentally validating untested drug combinations predicted for specific cell lines. Results Overview of comboFM model. comboFM was developed for predicting drug combination responses of cancer cell lines in three practical scenarios (Fig. 1a). The first scenario of predicting new dose–response matrix entries corresponds to filling in the gaps in partially measured dose–response matrices. In the second scenario of new dose–response matrix inference, the predictions are made for completely held out dose–response matrices of untested drug–drug–cell line triplets, such that the drug pair has still been observed in other cell lines. In the third and most challenging scenario of new drug combination inference, the predictions are made for completely new drug combinations with no available combination measurements in any cell line, thereby providing guidance on repositioning of the drugs for new combinations and cell contexts. To capture the high-order interactions between drug combinations in different cell lines and at various doses, comboFM models the multi-way interactions between the two drugs, the cell lines and the dose–response matrices as a fifth-order data tensor X (Fig. 1b). Furthermore, comboFM makes it possible to integrate any auxiliary data of the drugs and cell lines, such as chemical descriptors in the form of molecular fingerprints of drug compounds, gene expression profiles of the cancer cell lines and concentration values tested for the drugs. For the learning algorithm, the data tensor X is flattened into a two-dimensional array (Fig. 1c), where each row vector x identifies a single entry in the original tensor. Given the associated responses yi in the training data, comboFM model is learned using factorization machines (FMs). Higher-order FMs learn a non-linear regression model from the input features (x) to the output (y) by estimating a regression weight wi1 ;:::;it for each combination of input features xi1 xi2 xit , where t is the order of the interaction. However, instead of estimating the weights wi1 ;:::;it separately as in polynomial regression, FMs approximate the weights using factorized parametrization (Fig. 1d), where the weights are coupled through multiplication of latent factors learned by the FM. This approach avoids the computational and statistical problems that would result from directly estimating the weight tensor W. In addition, the coupling of the weights allows effective learning in situations where the data tensor is sparsely populated. NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z [D2] x x x x x x x x x x [D2] C x x x x x x x C C [D1] Predicting new doseresponse matrix entries [D1] [D1] Predicting new doseresponse matrices Predicting new drug combinations b X Chemical descriptors (e.g. fingerprint) Cell lines Tensor representation Genomic descriptors (e.g. gene expression) Dr ug 2 …,0,1,0,0,0,1,0,0,… Drug 1 Chemical descriptors (e.g. fingerprint) Drug 2 concentration …,0,1,0,0,0,1,0,0,… Drug 1 concentration c Feature representation x x %-growth x x x x [D2] %-growth %-growth Prediction scenarios a d features x1 1 x2 0 x3 1 xn 0 y1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0.3 0.9 0.1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1.1 1.2 0.1 1.0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0.4 0.0 1.0 0.1 y2 y3 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0.2 0.1 0.3 1.0 yn Drug 1 Drug 2 Cell line Drug 1 concentration 0.3 Drug 2 Drug 1 Drug 2 Cell line Concen- Response concentration chemical chemical genomic tration descriptor descriptor descriptor values Binary representation of the tensor structure (one-hot encoding) Additional real-valued or binary descriptors d k d Parameter estimation k pi 3: pi 1: d Wi 1,i 2,i 3 ≈ × d d k pi 2: d W ∈Rd 3 P ∈Rd ×k Fig. 1 Overview of the comboFM framework for the prediction of drug-dose combination effects. a Three prediction scenarios are considered: filling in missing entries in partially tested dose–response matrices, predicting a complete dose–response matrix in a new cell line, and making predictions for a completely new drug combination not tested so far in any cell line. b In each prediction scenario, the experimentally measured dose–response matrices are compiled into a fifth-order tensor X indexed by drugs (D1, D2), drug concentrations ([D1], [D2]) and cell lines (C), and genomic and chemical descriptors are integrated into the prediction model. c The structure of the tensor underlying the drug combination dose–response matrix data is one-hot encoded into a single feature matrix together with the additional chemical and genomic descriptors. d The model parameters wi1 ;i2 ; ¼ ;it , for a tth order combination (t = 3 P depicted) of features i1, …, it are approximated using factorized parametrization wi1 ;i2 ; ¼ ;it ks¼1 p1s p2s ¼ pts (see “Methods”). d denotes the total number of features and k is a hyperparameter defining the rank of the factorization. NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 3 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z Accurate drug combination response predictions by comboFM. To systematically evaluate the comboFM model, we used the anticancer drug combination response data from the NCIALMANAC study4. To enable various splits of data into different cross-validation folds as required by the different prediction scenarios and to keep the computational complexity manageable, we considered a subset of the data consisting of 50 unique FDA-approved drugs (Supplementary Table 3) in 617 distinct combinations screened in various concentration pairs across all the 60 cell lines originating from 9 tissue types23. In this data subset, a total of 333,180 drug combination response measurements and 222,120 monotherapy response measurements of single drugs are available in the form of percentage growth of the cell lines (see “Methods”). To computationally quantify the performance of comboFM in predicting drug combination responses and optimize the model parameters, we performed a 10 × 5 (10 outer folds, 5 inner folds) nested crossvalidation (CV) procedure under the three prediction scenarios (see “Methods”). The order of the feature interactions modeled by the FM was set to m = 5, according to the order of the underlying tensor. To investigate the benefit of considering higher-order feature interactions, we also performed experiments using both second order formulation of FMs and first order FMs (corresponding to ridge regression). To further benchmark the predictive performance of comboFM, we applied random forest (RF) as a reference model, a widely-used machine learning model that is based on a rather different learning principle, and has previously been used for modeling drug combination effects24–28, including the winning method of the recent AstraZeneca-Sanger drug combination prediction DREAM Challenge19. The crossvalidation folds were held fixed throughout the experiments to ensure a fair comparison. We assessed the predictive performance of the methods using root mean squared error (RMSE), as well as Pearson and Spearman correlation between original and predicted dose–response matrices. By leveraging the multi-way interactions present in the underlying high-dimensional drug combination space across drugs, drug concentrations, and cancer cell lines, the 5th order comboFM demonstrated high predictive accuracy in all the three prediction scenarios (Fig. 2), outperforming the random forest reference (p < 10−10 in all prediction scenarios, two-sided Wilcoxon paired signed rank sum test, N = 666,360). In the scenarios of predicting new dose–response matrix entries and new dose–response matrices, the 5th order comboFM obtained a Pearson correlation of 0.97, and even in the new drug combination prediction scenario, the 5th order comboFM obtained a Pearson correlation of 0.95. The 5th order comboFM was also markedly more accurate than both the 1st- and 2nd order comboFMs in all the three scenarios. Similar relative performance of the methods was also observed using Spearman correlation and RMSE (Fig. 2). In addition, the distribution of the predictions by 5th order comboFM followed that of the measured responses most accurately (Supplementary Fig. 1). In addition to the global predictive performance of the methods, we analyzed also their performance in different tissue types and across the various types of drug combination therapies (Fig. 3 and Supplementary Figs. 2–4 and Supplementary Table 1). In all the three prediction scenarios (Fig. 3a–c), comboFM showed the highest average prediction accuracy in each of the tissue types, and also the smallest variance across the tissue types. The combination response in colon cancer appeared marginally more difficult to predict than the other tissue types, which is likely explained by higher variation in the colon cancer response data, as the number of colon cancer cell lines was similar to the other tissue types and thus the marginally inferior performance is 4 unlikely to stem from limited data quantity. Nevertheless, the 5th order comboFM was still the most accurate method also in colon cancer cell lines. Furthermore, comboFM was shown to provide high accuracies across various types of combination therapies (chemotherapies, targeted therapies, and other therapies, such as hormonal therapies) (Fig. 3d–f). The combination therapies involving drugs from the Other class include the smallest number of observations, explaining their reduced predictive accuracy with all the methods. To further validate the performance of the 5th order comboFM, we also evaluated its predictive accuracy in the remaining part of the NCI-ALMANAC data that was not used in the cross-validation, consisting of 4737 distinct drug combinations. The model was trained on the full development dataset of 617 drug combinations as well as the monotherapy responses of the single drugs in the validation set, and the trained model was then used for predicting responses of the 4737 drug combinations in the validation set across the various cell lines. 5th order comboFM demonstrated high predictive accuracy also in this validation set (Supplementary Figs. 5 and 6), with Pearson correlation of 0.91 even for combinations where neither drug had previously been observed in any other combination, i.e. only the monotherapy responses of the individual drugs in the combination were available to the model. Synergy scores can be recovered with high accuracy based on the predicted dose–response matrices. As the interest in drug combination experiments often lies in discovering the most synergistic drug combinations, we also quantified drug combination synergies based on the dose–response matrices predicted with comboFM. As a synergy quantification model, we applied NCI ComboScore (see “Methods”)4, computed over the complete predicted dose–response matrices. Although drug combinations with an NCI ComboScore above zero are technically defined to be synergistic, combinations with highly synergistic effects are typically considered as more attractive candidates for further experimental validation. Therefore, we labeled the extreme synergistic drug combinations (observed NCI ComboScore value in the top 10%) as the positive class and the remaining combinations, including lowly synergistic, additive, and antagonist combinations, as the negative class. Drug combination synergy scores were recovered with a high accuracy from the dose–response matrices predicted by the 5th order comboFM in all three prediction scenarios, significantly outperforming the other compared methods (Supplementary Fig. 7). Importantly, the drug combination synergies could be accurately computed based on the predicted dose–response matrices using 5th order comboFM even in the challenging scenario of predicting new drug combinations, with a Pearson correlation of 0.72 (p < 10−10, two-sided t-test, N = 74,040) between the observed and predicted NCI ComboScores. In the task of discriminating highly synergistic drug combinations, the 5th order comboFM obtained a high area under the receiver characteristic operator curve (AUC) of 0.91 in the new drug combination prediction task (Supplementary Fig. 8). The discrimination accuracies were at high level in each prediction scenario, and when using various top-% extreme synergy combinations (Supplementary Fig. 8). Experimental validation of the most synergistic predicted drug combinations. To further demonstrate the ability of comboFM to predict novel and robust drug combinations, the model was trained using all the available dose–response measurements in the development dataset, and the trained comboFM was then used to predict dose–response matrices for remaining NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z a Predicting new dose–response matrix entries RMSE : 17.89 RSpearman : 0.91 100 y = 3.4 + 0.95 x 0 –100 NCI ComboScores, RPearson : 0.92 –200 –200 –100 0 100 100 200 y = 13 + 0.82 x 0 –100 NCI ComboScores, RPearson : 0.66 –200 –200 200 –100 RMSE : 10.91 RMSE : 31.56 RPearson : 0.91 RSpearman : 0.84 0 100 RPearson : 0.70 200 RSpearman : 0.66 100 Predicted response, %-growth 200 Predicted response, %-growth Predicted response, %-growth RPearson : 0.97 Predicted response, %-growth RMSE : 9.86 200 y = 43 + 0.4 x 0 –100 NCI ComboScores, RPearson : –0.19 –200 –200 200 –100 0 100 RPearson : 0.97 RSpearman : 0.91 100 y = 6.8 + 0.91 x 0 –100 NCI ComboScores, RPearson : 0.83 –200 200 –200 –100 0 100 Measured response, %-growth Measured response, %-growth Measured response, %-growth Measured response, %-growth comboFM-5 comboFM-2 comboFM-1 RF 200 b Predicting new dose–response matrices RMSE : 18.00 RSpearman : 0.91 100 y = 3.4 + 0.95 x 0 –100 NCI ComboScores, RPearson : 0.84 –200 –200 –100 0 100 200 RSpearman : 0.83 100 y = 13 + 0.82 x 0 –100 NCI ComboScores, RPearson : 0.63 –200 –200 200 –100 RMSE : 12.23 RMSE : 31.57 RPearson : 0.91 0 100 RPearson : 0.70 200 Predicted response, %-growth 200 Predicted response, %-growth Predicted response, %-growth RPearson : 0.97 Predicted response, %-growth RMSE : 10.39 200 RSpearman : 0.66 100 y = 43 + 0.4 x 0 –100 NCI ComboScores, RPearson : –0.19 –200 200 –200 –100 0 100 RPearson : 0.96 RSpearman : 0.90 100 y = 7.9 + 0.89 x 0 –100 NCI ComboScores, RPearson : 0.69 –200 –200 200 –100 0 100 Measured response, %-growth Measured response, %-growth Measured response, %-growth Measured response, %-growth comboFM-5 comboFM-2 comboFM-1 RF 200 c Predicting new drug combinations y = 5.1 + 0.93 x 0 –100 NCI ComboScores, RPearson : 0.72 –200 –200 –100 0 100 200 RPearson : 0.89 200 Predicted response, %-growth 200 Predicted response, %-growth Predicted response, %-growth 100 RMSE : 31.79 RMSE : 19.37 RPearson : 0.95 RSpearman : 0.88 RSpearman : 0.81 100 y = 14 + 0.81 x 0 –100 NCI ComboScores, RPearson : 0.52 –200 –200 –100 0 100 200 RMSE : 15.44 RPearson : 0.69 200 Predicted response, %-growth RMSE : 13.04 200 RSpearman : 0.66 100 y = 43 + 0.39 x 0 –100 NCI ComboScores, RPearson : –0.21 –200 –200 –100 0 100 200 RPearson : 0.93 RSpearman : 0.86 100 y = 12 + 0.84 x 0 –100 NCI ComboScores, RPearson : 0.48 –200 –200 –100 0 100 Measured response, %-growth Measured response, %-growth Measured response, %-growth Measured response, %-growth comboFM-5 comboFM-2 comboFM-1 RF 200 Fig. 2 Predictive performance of 5th (comboFM-5), 2nd (comboFM-1) and 1st order comboFM (comboFM-1), and random forest (RF) as scatter plots between the measured and predicted dose–response matrices. The responses were measured by percentage growth in the three prediction scenarios: a new dose–response matrix entries, b new dose–response matrices, and c new drug combinations. Root mean squared error (RMSE), Pearson correlation (RPearson) and Spearman correlation (RSpearman) for the drug combination response prediction are reported as averages over 10 outer CV folds. The Pearson correlation of the NCI ComboScores is reported as an average over all computed NCI ComboScores, computed based on the predicted dose–response matrices. Trend line and its equation are shown for each scatter plot. unmeasured drug combinations across all the 60 cell lines, which resulted in a total of 10,320 predicted complete dose–response matrices. Experimental validation was performed subsequently on a subset of 16 drug combinations specific for 4 cell lines (Supplementary Table 2), where high synergy was predicted by comboFM. These combinations were selected to mainly involve molecularly targeted therapies, as the recent interest has increasingly evolved toward targeted agents over the standard cytotoxic chemotherapies. In particular, we focused on cancer-specific drug combinations which were predicted to have highly synergistic effects only in a subset of all the cell lines and tissue types. This poses a more challenging task than identifying broadly toxic combinations that kill most cancer cells, but which may also induce severe toxicities in the healthy cells. As in the previous experiments, we considered as highly synergistic those combinations with an observed NCI ComboScore values in the top 10% in a particular tissue type. The results of the experimental validation of 16 drug–drug–cell line triplets are summarised in Fig. 4, using the Bliss model to quantify the observed synergy. The background histogram shows a distribution of an in-house drug combination dataset, consisting of 60 drug combinations tested against 16 KRAS-mutants pancreatic ductal adenocarcinoma cell lines. Since the combinations in the reference set were not randomly-selected, the background synergy distribution shows a slight positive bias; however, since the assay was the same as the one used for the experimental validation of comboFM predictions (“Methods”), it is expected to provide a valid reference distribution for statistical evaluations. All the drug combinations predicted by comboFM were validated as synergistic, when considering positive Bliss score as evidence for a degree of synergy (p < 10−4, binomial test against the background distribution). Importantly, 9 out of 16 combinations had a Bliss synergy score higher than 90% of the background distribution (p < 10−5, binomial test). In addition to Bliss synergy score, we also computed the synergy scores using three other popular synergy models: Loewe, highest single agent (HSA) and zero-interaction potency (ZIP) scores (Supplementary Figs. 9 and 10). These results demonstrate the robustness of the comboFM predictions across various experimental setups and synergy scoring models. NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 5 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z Predicting new dose-response matrix entries a d Tissue type Breast CNS Colon Drug classes Ovarian Prostate Renal Haematological Melanoma NSC Lung Chemotherapy - Chemotherapy Chemotherapy - Other Other - Other 1.0 Targeted - Chemotherapy Targeted - Other Targeted - Targeted 1.0 Pearson correlation Pearson correlation 0.9 0.8 0.7 0.6 0.5 0.0 0.5 0.4 comboFM-5 comboFM-2 comboFM-1 RF comboFM-5 comboFM-2 comboFM-1 RF Predicting new dose-response matrices b e Tissue type Breast CNS Colon Drug classes Ovarian Prostate Renal Haematological Melanoma NSC Lung Targeted - Chemotherapy Targeted - Other Targeted - Targeted Chemotherapy - Chemotherapy Chemotherapy - Other Other - Other 1.0 1.0 Pearson correlation Pearson correlation 0.9 0.8 0.7 0.6 0.5 0.0 0.5 0.4 comboFM-5 comboFM-2 comboFM-1 RF comboFM-5 comboFM-2 comboFM-1 RF Predicting new drug combinations c f Tissue type Breast CNS Colon Haematological Melanoma NSC Lung Drug classes Ovarian Prostate Renal Chemotherapy - Chemotherapy Chemotherapy - Other Other - Other Targeted - Chemotherapy Targeted - Other Targeted - Targeted 1.0 1.0 Pearson correlation Pearson correlation 0.9 0.8 0.7 0.6 0.5 0.0 0.5 0.4 comboFM-5 comboFM-2 comboFM-1 RF comboFM-5 comboFM-2 comboFM-1 RF Fig. 3 Predictive performance of 5th (comboFM-5), 2nd (comboFM-2) and 1st order comboFM (comboFM-1), and random forest (RF) across tissue types and drug classes in the three prediction scenarios. a–c tissue types. d–f drug classes. The three prediction scenarios are depicted as follows: a, d predicting new dose–response matrix entries, b, e predicting new dose–response matrices, and c, f predicting new drug combinations. Further information on the drug classes can be found in Supplementary Table 3. In the boxplots, the horizontal lines drawn in the middle denote the median, and the lower and upper hinges correspond to the 25th and 75th percentiles, respectively. The upper and lower whiskers denote the largest and smallest values, respectively, no further than 1.5 times the inter-quartile range (IQR). The points that are not included between the whiskers are outlier predictions. 6 NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z Experimentally tested combinations selected based on comboFM predictions Everolimus - Romidepsin - MALME-3M Erlotinib - Axitinib - IGROV1 Oxaliplatin - Romidepsin - MALME-3M Romidepsin - Everolimus - HS578T Romidepsin - Vismodegib - MALME-3M Teniposide - Lenalidomide - SR Gefitinib - Vismodegib - MALME-3M Lomustine - Gefitinib - IGROV1 Lomustine - Gefitinib - SR Dactinomycin - Romidepsin - MALME-3M Fulvestrant - Crizotinib - SR Vandetanib - Exemestane - IGROV1 Cladribine - Romidepsin - MALME-3M Crizotinib - Bortezomib - SR Thioguanine - Megestrol acetate - SR Vismodegib - Romidepsin - HS578T 50% Number of combinations (reference dataset) 175 75% 90% 99% 150 125 100 75 50 25 0 −20 −10 0 Bliss synergy score 10 20 Fig. 4 Measured drug combination synergy scores in the experimental validation. In-house experimental validation of 16 selected predictions in specific cell lines are shown as colored lines (on top), and the histogram shows a background distribution from in-house reference dataset that comprises of 60 drug combinations tested against 16 KRAS-mutants pancreatic ductal adenocarcinoma cell lines (see “Methods”). The synergy was quantified using Bliss independence score for the most synergistic area of the dose–response matrix (see Supplementary Fig. 9 for other synergy scores). The color scale corresponds to the Bliss scores (green—antagonistic response, white—independent response, red—synergistic response). Dashed lines denote the percentiles of the background distribution obtained using the same experimental setup. Among others, comboFM predicted a particularly high level of synergy for the combination between anaplastic lymphoma kinase (ALK) inhibitor crizotinib and proteasome inhibitor bortezomib in lymphoma cell line SR. In addition to our inhouse experimental validations, this finding was further validated in external measurements in the NCI-ALMANAC data that were not used as part of comboFM training data. The ALK inhibitors are effective against cancers harboring ALK fusions. The SR cell line carries the NPM1-ALK fusion, which is the first ever discovered ALK fusion in large-cell lymphoma29. Bortezomib is approved for mantel cell lymphoma supporting its potential in lymphoma treatment. It is likely that two even mildly effective inhibitors when used in combination may enhance the inhibition effect and potentially overcome monotherapy resistance. Notably, comboFM made this prediction without knowledge of the ALK fusion status of the SR cell line, i.e., this biological rationale was not available for the model. The prediction of high synergy between the first-generation inhibitors of ALK and proteasome for lymphoma cell lines highlights the potential of comboFM to predict biologically plausible combination effects. The comboFM model identified also another unique drug combination effective against the SR cell line, the combination of EGFR inhibitor gefitinib with an approved chemotherapy lomustine for lymphoma treatment. One of the mechanisms inducing resistance to ALK inhibitors is activation of EGFR, as they signal through similar downstream pathways. Brigatinib, a dual ALK/EGFR inhibitor, is therefore being explored in clinical settings against lymphoma and lung cancer patients (NCT01449461). Our comboFM method predicted combination partners to extensively explored ALK and EGFR inhibitors for lymphoma, which we were able to also validate in the experimental setting (Fig. 4). These examples show the potential of comboFM to identify novel combinations of both targeted and cytotoxic treatments, that individually are already used as lymphoma treatments, and therefore are likely to have acceptable toxicity profiles in clinical applications. Discussion Given the enormous number of conceivable drug and dose combinations, computational approaches are needed to accelerate the experimental work by providing guidance toward identifying the most promising drug combinations for further experimental validation. While large datasets of drug combination dose–response matrices have already been tested in the lab, extensive gaps still remain in the combinatorial space among both targeted and non-targeted therapies, as well as hormonal and immunotherapies. Here, we have presented a novel machine learning framework, comboFM, for large-scale systematic prediction of drug combination effects in human cancer cell lines. The obtained results demonstrate that comboFM can leverage predictive higher-order relationships between drugs, drug concentrations, and cancer cell line responses, which were missed when using random forest and simpler approaches, including 1st and 2nd order formulation of comboFM. Importantly, comboFM can accurately generalize the predictions also for new drug combinations not observed in the training space, which enables one to systematically predict dose–response matrices also for so far untested drug combinations formed by the individual drugs in the training set. This will provide guidance on repositioning the drugs into new combinations. We also demonstrated that comboFM consistently obtains high prediction performance across various tissue types and classes of drug combination therapy. In addition, 5th order comboFM was 3 times faster to train compared to the random forest reference when run on the same CPU and considering relatively conservative amount of 200 training epochs for training the comboFM model (Supplementary Table 2). Further performance advantages were obtained by employing a GPU for training the 5th order comboFM model (34 times faster compared to random forest). Modeling the drug combination effects first at the level of dose–response matrices and subsequently quantifying the level of overall drug combination synergy over the full matrix provides many benefits compared to approaches that directly aim at predicting the drug combination synergies. First of all, predicting the underlying dose–response matrices enables one to leverage all the information contained in the dose–response matrices and provides detailed information of the response landscape across various dose combinations. In addition, in the second stage, one is not limited only by a single synergy quantification model, but can explore the synergies using various models, hence gaining a more comprehensive view of the synergistic drug combination NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 7 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z landscapes30. Furthermore, understanding the drug combination effects both at the dose level as well as at the synergy level provides useful guidance for precision medicine efforts. For instance, combination synergies observed at lower doses are often better tolerated in the clinical practice. Furthermore, it has been shown that for most of the FDA-approved drug combinations, only little evidence of additivity or synergy was observed in pre-clinical models31, highlighting that synergy is not always needed for clinical treatment success. However, it has also been argued that patient stratification based on predictive markers is likely to reduce variability in clinical therapy responses, and contribute to achieving truly synergistic responses to combination treatments32. In-house experimental validations of the top-synergistic combinations predicted using the NCI-ALMANAC data demonstrated that the comboFM predictions are robust also to the experimental setup. The in-house assay had many experimental differences when compared to the combination assay used to profile the NCIALMANAC development dataset. In particular, the in-house assay measured the drug combination responses in the form of percentage inhibition, instead of percentage growth that is used in the NCI-ALMANAC assay. Therefore, we could not calculate the NCI ComboScore for the experimental validations, but instead scored the combinations using four popular synergy models (Supplementary Figs. 9 and 10). As an example, comboFM predicted a pivotal role of histone deacetylase (HDAC) in melanoma cell line MALME-3M, thereby suggesting potential of HDAC inhibition against melanoma. In particular, various combinations with HDAC inhibitor romidepsin were predicted to be effective against BRAFmutants melanoma cell line MALME-3M, which also held true in the experimental settings (Fig. 4). Even though most of the drugs in the romidepsin-combinations have already been explored in different combinations to target melanoma33,34, the combinations predicted by comboFM have remained unexplored against melanoma, and warrant further investigation. Individually, each of these inhibitors have shown promising results in pre-clinical or clinical settings against melanoma, further supporting their use in combination therapies. Even though the main objective of this work was to develop and carefully validate the comboFM model in cancer cell lines as an accurate methodology for systematic prediction of drug combination responses for biological discovery, we note that many of the drugs identified by comboFM have been or are currently being explored in clinical settings against the specific cancer type, either as single agents or in combination with other drugs (see Supplementary Table 5). For instance, HDAC inhibitor vorinostat is being tested against BRAF-mutant advanced melanoma in an ongoing clinical trial (ref. 35; NCT02836548). Similarly, mTOR inhibitor everolimus is shown to selectively target BRAF-mutant melanoma in acidic condition36. In an ongoing clinical trial, mTOR inhibitors everolimus or temsirolimus in combination with BRAF inhibitor are being investigated against BRAF-mutant advanced solid tumors (NCT01596140). SMOinhibitor vismodegib blocks Hedgehog pathway which regulates the skin growth. In case of medulloblastoma, HDAC inhibitors are active against even SMO-inhibitor resistant cell lines37. Hence, concurrent use of HDAC- and SMO- inhibitors holds a promising strategy to target melanoma, as predicted by romidepsin and vismodegib combination (Fig. 4). In the same line of rationale, combining HDAC inhibitor with DNA damaging agents, such as oxaliplatin, dactinomycin, and cladribine, holds strong promises and are explored in different pre-clinical and clinical settings33,34,38,39. These case examples already unveil the potential of our method for predicting combinations with translational potential, although these findings warrant further validation in proper clinical trials. 8 Furthermore, once the model accuracy has been confirmed in the cell line resources, we envision that the carefully validated model will be applicable also to data from individual cancer patients, thereby providing means for tailoring effective combinations in precision oncology applications. For selected cancer types, such as haematological malignancies, molecular and drug response profiling data are becoming available from patient-derived primary cells that can be used for training cancer type-specific prediction models40,41. Once similar data from other cancer types becomes available, comboFM will enable also pan-cancer analyses, similar to the current analyses in the NCI-ALMANAC cell lines. We found that many of the combinations predicted in the NCIALMANAC cell lines have actually already been tested in clinical trials (Supplementary Table 5). Interestingly, most of the combinations are tested in different indications than what was predicted based on the cell lines, suggesting further drug repurposing opportunities. The comboFM predictions require input data that start to be routinely available in many functional precision medicine studies, making it therefore broadly applicable for many cancer types and therapy classes. In the present study, we assumed that one knows the monotherapy responses of single drugs prior to predicting the combination responses, as in practice it is often needed to know the concentration ranges and potencies of the single drugs (i.e., dose–response curves) in order to know which dose combinations should be used in combination testing, and also how potent the compounds are individually. comboFM strongly benefits from this information due to its capability to interpolate in the space of dose–response matrices through the computation of latent factors representing similarly behaving drug combinations from the response tensor alone (similarly to recommender systems grouping users by the movies they have liked in the past), while the drug and cell line descriptors merely fine-tune the predictions. It is plausible that by careful experimental design, one could minimize the number of monotherapy responses needed for accurate dose–response matrix prediction42 whilst maintaining the accuracy of the comboFM model, which we leave as an interesting future research topic. However, in a scenario where one would like to perform predictions for completely new molecules with no prior monotherapy or combination response data in any cell line, the computed latent factors are no longer helpful, and none of the methods could perform well with the current design (Supplementary Fig. 13). This limitation of the methodology in such scenarios could potentially be addressed by more extensive feature engineering or by developing models that are specialized for the case of predicting dose–response matrices for combinations of completely new drugs. As with any high-throughput pre-clinical data, the cell line drug response profiles may show inconsistency in experimental outputs across the same cell line-treatment pairs43. Therefore, we argue that it is important to develop and initially evaluate the prediction models in large enough and standardized cell line resources, such as NCI-ALMANAC, to avoid any reproducibility issues in the development phase. We further tested the model predictions using distinct experimental setups in the same cell lines to show that the predictions were robust enough against such biological and technical variability. In conclusion, given the high cost of the experimental screening of drug combinations, comboFM has the potential to provide time- and cost-effective means toward prioritizing the most promising drug combinations for further pre-clinical or clinical studies. The accurate and robust drug combination response predictions provide a promising approach to streamline the development and expansion of combination therapeutics in personalized cancer treatment. This could ultimately accelerate NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z the clinical use of combination therapeutics to combat acquired drug resistance and to increase therapeutic efficacies. and q: yðA; BÞ ¼ X ðyc ðAp ; Bq Þ ye ðAp ; Bq ÞÞ p;q Methods Higher-order factorization machines. comboFM uses higher-order factorization machines (HOFM)20,21 for predicting the drug–drug combination responses. HOFMs are non-linear regression models learned with a training set of examples fðx1 ; y1 Þ; ðx2 ; y2 Þ; :::; ðxn ; yn Þg of feature vectors x 2 Rd and output labels y 2 R. A trained HOFM models the output y 2 R as a function of single, pairwise, and higher-order interactions between input features up to order m: ^yðxÞ :¼ d X wi x i þ i¼1 X 1 ≤ i < i0 ≤ d X wi;i0 xi xi0 þ ::: þ 1 ≤ i1 < ::: < im ≤ d wi1 ;i2 ;im xi1 xi2 :::xim : ð6Þ where yc(Ap, Bq) is the combination growth fraction of the cell line exposed to drug A in concentration p and drug B in concentration q, and ye(Ap, Bq) is the expected growth fraction for the combination defined based on the monotherapy effects of drug A and drug B as follows: ( minðym ðAp Þ; ym ðBq ÞÞ if ym ðAp Þ ≤ 0 or ym ðBq Þ ≤ 0 ye ðAp ; Bq Þ ¼ ð7Þ 1 ym ðAp Þ ~ym ðBq ÞÞ otherwise 150 ð~ where ym(Ap) and ym(Bq) denote the monotherapy effects of drug A in concentration p and drug B in concentration q, respectively. We applied ~ym ¼ minðym ; 150Þ that truncates the growth fraction at 150, with the threshold selected based on the histogram of the measured drug combination responses (Supplementary Fig. 11). ð1Þ The first term corresponds to a linear model, and all parameters wi are independently estimated. The higher-order parameters are, on the other hand, estimated in a factorized form wi;i0 ¼ hpi ; pi0 i ð2Þ ð2Þ ð2Þ ðtÞ ðtÞ ð3Þ ðtÞ wi1 ;i2 ;it ¼ hpi1 ; pi2 ; ¼ ; pit i; t ¼ 3; ¼ ; m k where pm i 2 R denotes the mth order factor weight of feature i, k is the hyperparameter defining the rank of the factorization, and ha1 ; a2 ; ¼ ; am i ¼ k X a1s a2s ams ð4Þ s¼1 denotes a generalized inner product of m vectors ai 2 Rk ; i ¼ 1; ¼ ; m that generalizes the usual pairwise inner product ⟨a, b⟩ = aTb to sets of m vectors. d´k m T . The factor weights are collected into matrices PðmÞ ¼ ðpm 1 ; ¼ ; pd Þ 2 R The factorized parametrization drastically reduces the number of estimated parameters from O(dm) (all feature combinations have their own parameter) to O (kdm) (m − 1 factor matrices of dimension d × k). In principle HOFMs allow an unique rank kt for each order t = 2, ..., m. In the above description and in our experiments, we used uniform rank k = k2 = … = km. FMs are based on the assumption that the effect of pairwise and higher-order feature interactions has a low rank and allows FMs to estimate reliable parameters even under highly sparse data. Hence, the co-occurrence of xi and xi0 does not need to be observed in order to learn wi;i0 : the factors pi0 : and pi0 : can be learned by interacting with other dimensions and the dot product of pi0 : and pi0 : still gives wi;i0 . This is extremely useful in the case of high-dimensional drug combination data where the input tensor is typically very sparse, and thus allows to make reliable inferences of the responses to new drug combinations whose individual components have still been observed in other combinations elsewhere in the training tensor. Compared to standard matrix factorization approaches, FMs provide additional flexibility by allowing integration of auxiliary data describing the drugs and cell lines, such as chemical and genomic descriptors. The objective function of learning higher-order factorization machines is to minimize the regularized mean squared error min n m X 2 β 1X βt ðtÞ 2 y yî ðxi Þ þ 1 jjwjj2 þ jjP jj n i¼1 i 2 2 t¼2 ð5Þ where β1, . . . , βm > 0 are regularization parameters. To limit the number of hyperparameter combinations to search, following the work by Blondel et al.20, we set β1 = . . . = βm, and a uniform rank k = k2 = . . . = km. In the experiments, we used a recent TensorFlow implementation of higher-order factorization machines44. On the NCI-ALMANAC data, increasing the order and rank of the factorization machine both improve the predictive performance (Pearson correlation) of the comboFM model (Supplementary Fig. 12). The predictive performance increases steeply until order 5, which matches the intrinsic order of the data tensor X (See Fig. 1b), and then continues to increase more slowly. The performance increase due to increasing rank of the factorization is rapid until around rank 50 and then continues to increase more slowly. There is no apparent overfitting even with factorization order as high as 10 and rank as high as 150. Synergy quantification. As the interest often lies in discovering the most synergistic drug combinations, we quantify the drug combination synergies based on the predicted dose–response matrices. To compute the synergy scores, we apply the NCI ComboScore, which was introduced along with the NCI-ALMANAC dataset4, originally modified from the Bliss independence score. The NCI ComboScore for drug A and drug B is defined as the sum of the deviations between expected and observed responses over all concentrations p Training setup. In order to evaluate the predictive performance and optimize the model parameters under the three prediction scenarios, we performed a 10 × 5 (10 outer folds, 5 inner folds) nested cross-validation procedure. For all the factorization machine models, the rank parameter was optimized in the range k = {25, 50, 75, 100} and the regularization parameter in the range β = {102, 103, 104, 105}. The order of the modeled feature interaction was set to 5 according to the order of the underlying tensor, as a compromise between the training time and prediction accuracy. The learning rate was set to 0.001 based on preliminary experiments and other parameters were kept in their default values. The number of trees of the random forest model was optimized in the range {32, 64, 128, 512} and the fraction of features considered when looking for the best split (MaxFeatures) in the range {0.25, 0.5, 0.75, 1.0}. As each input sample is represented by a single feature vector, in order to take the symmetry of the drug combinations into account, the samples were duplicated such that both of the drugs in a combination were included in both positions in the feature vectors. This informs the algorithm that the combination of drug A with drug B should be considered the same as the combination of drug B with drug A. The prediction accuracy of all the models was assessed using the same performance evaluation metrics: RMSE, Pearson correlation, and Spearman correlation. Evaluation of the prediction performance. In this type of applications, the predictive performance is significantly affected by whether the training and test sets share the different components of the modeled interactions, and it is thus important to reliably quantify the prediction accuracy under practical application scenarios. Therefore, we evaluated the predictive performance of comboFM under three prediction scenarios: (a) new dose–response matrix entry prediction, (b) new dose–response matrix prediction and (c) new drug combination prediction (c.f. Fig. 1). For each scenario, we used dedicated nested cross-validation setups to ensure unbiased evaluation. In scenario (a), the predictions were made for individual held-out entries in dose–response matrices. The held-out entries were selected at random for each cross-validation fold. In scenario (b), the predictions were made for completely held out (dose–response matrix, cell line) pair, such that the same drug combination had still been measured in other cell lines. This scenario corresponds to a widely-used strategy in other computational works concerning drug combination synergy prediction, in which the predictions are made for new drug–drug-cell line triplets. In scenario (c), most challenging scenario of new drug combination prediction, the predictions are made for novel drug combinations outside the training space with no available combination measurements. In all prediction scenarios, we assumed that the monotherapy responses of the single drugs in the combination are known. To computationally evaluate the prediction performance and optimize the model parameters, we performed a nested cross-validation procedure. In the first prediction scenario of new dose–response matrix entry prediction, the crossvalidation folds were formed by simply random sampling from the tensor entries. In the second prediction scenario concerning new dose–response matrices, the folds were created by randomly sampling on the level of dose–response matrices, i.e., if a drug pair-cell line triplet (xd1 ,xd2 ; xc ) belonged to the test set, the training tensor did not include any entry involving the triplet (xd1 ,xd2 ; xc ). In the third scenario of new drug combination prediction, the random sampling was performed on the level of drug pairs and all the entries involving the test drug pairs were held out from the training set, i.e., if a drug pair (xd1 ,xd2 ) belonged to the test set, the training tensor did not contain any entry involving the pair (xd1 ,xd2 ). Furthermore, we ensured that the individual drugs in the left out drug pairs are still observed individually in other combinations in the training set, which enables the model to learn from the way the individual drugs in the held out combinations act in other combinations. Drug combination anticancer activity dataset. The drug combination anticancer activity dataset was obtained from a recent NCI-ALMANAC study4, which is the largest available drug combination dataset to date. The original dataset covers over 5000 combinations of roughly 100 small molecule drugs screened against 60 cell NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 9 ARTICLE NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z lines in various concentrations, containing over 3 million response measurements. The drugs included in the dataset are FDA-approved oncology drugs with proven activity and established safety profiles. The cell lines represent human tumor cell lines from the NCI-60 panel, originating from 9 different tissue types. To reduce the computational complexity, we selected a subset of the NCIALMANAC dataset by randomly sampling 50 drugs (Supplementary Table 3) from the original set of drugs, ensuring that the distribution of the subset of drug combination responses matched to that of the original one. Furthermore, we selected drug combinations for which complete measurements across all the 60 cell lines were available. As a result, we obtained a dataset for our experiments consisting of 617 drug combinations of 50 unique drugs, screened in 45 unique concentrations against 60 cell lines, containing 333,180 response measurements for combinations and 222,120 measurements for monotherapies, measured by percentage growth of the cell line with respect to a control. Each drug combination in the dataset had been screened using 4 × 4 dose–response matrix design. incubation, 25 μl per well of CellTiter-Glo (Promega) reagent was added, and after 10 min of incubation at room temperature, luminescence (cell viability) was measured using PheraStar plate reader (BMG Labtech). Data representation. Defining an informative input feature representation of the underlying data is essential to take the full advantage of comboFM and FMs in general. By defining appropriate input features, FMs have been shown to have the representation power encompassing a variety of matrix and tensor factorization models from standard models to more specialized ones21,22. Hence, by learning FMs, all the subsumed factorization models can also be learned. In order to represent the structure of the tensor underlying the drug combination response data as single input feature vectors, one-hot encoding is used. Here, the input feature vectors x are divided into five different groups corresponding to the different modes of the tensor: two sets of drugs, their concentrations, and a cell line. In each group, exactly one value is set to 1 and the rest to 0, with 1 denoting the instance that is present in the corresponding interaction: Code availability 0 1 C B x ¼ @0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0 0; :::; 0; 1; 0; :::; 0A: |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflffl} jDrugsj jDrugsj jConcentrationsj jConcentrationsj ð8Þ Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article. Data availability The NCI-ALMANAC dataset is publicly available from National Cancer Institute (NCI) at https://wiki.nci.nih.gov/display/NCIDTPdata/NCI-ALMANAC. The preprocessed data used in the computational experiments and in-house drug combination testing data for validating comboFM predictions are available at https://doi.org/10.5281/ zenodo.4135059. Source data underlying the figures and display items are provided at https://doi.org/10.5281/zenodo.4135059 subdirectory source_data. The code is available at https://doi.org/10.5281/zenodo.4129688. Received: 16 April 2020; Accepted: 5 November 2020; References 1. 2. jCell linesj As the feature vector is non-zero only for the pair of drugs, drug concentrations, and cell line present in the corresponding interaction, all the other interactions in the FM model vanish and the model corresponds to standard factorization models involving categorical variables. However, whereas standard factorization models are limited to categorical input data only, comboFM and FMs can also incorporate auxiliary features in addition to the information of the interacting elements, which can further aid the prediction task, particularly when making predictions outside the training space. In this work, we used chemical descriptors of molecules and genomic descriptors of cell lines (see below for details). Chemical descriptors. As chemical descriptors, we integrated molecular fingerprints, binary vectors which are designed to represent the structure of a molecule as a series of bits, each one representing the presence or absence of a particular substructure. We selected a popular fingerprint of type ‘estate’, consisting of 79 bits corresponding to the E-State atom types originally defined by45, obtained from the rcdk R package46. Fingerprint bits with zero variance across the dataset were further removed, resulting in remaining 34 bits for the two sets of drugs. Genomic descriptors. As genomic descriptors, we incorporated gene expression profiles of the cancer cell lines, obtained from the rcellminer R package47. The gene expression profiles were measured with five different platforms (four Affymetrix arrays and an Agilent Whole Human Genome Oligo array) and a combined average z-score was reported as a combined gene expression for a gene. To reduce the dimensionality of the resulting feature matrix, we selected 0.5% of the genes with the highest variance across the samples, resulting in 78 gene expression values for each cell line. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Cell lines. Early passage cells lines purchased from ATCC (HS-578T & Malme3M) and NCI-Frederick DCTD tumor/cell lines repository (SR & IGR-OV1) were used for drug combination screening. The cell lines were maintained at 37 °C with 5% CO2 in a humidified incubator in their respective medium (see Supplementary Table 4a). All the reagents were purchased from ThermoFisher Scientific. All the cell lines were tested negative for mycoplasma. The test was based on the method described by Choppa et al.48 and was performed as a service by the sample management laboratory of THL Biobank, Helsinki, Finland. 13. Drug combination screening. The drug combination testing experimental design was adopted from Gautam et al.49. Seven different concentrations in log3-fold dilution of two drugs were combined with each other in 8 × 8 matrix formats. Please refer to Supplementary Tables 4b and c for the dug information and combinations design, respectively. The compounds were plated to black clear bottom 384-well plates (Corning #3764) using an Echo 550 Liquid Handler (Labcyte). 100 μM benzethonium chloride (BzCl2) and 0.1% dimethyl sulfoxide (DMSO) were used as positive and negative controls, respectively. All subsequent liquid handling was performed using MultiFlo FX multi-mode dispenser (BioTek). The pre-dispensed compounds were dissolved in 5 μl of culture media and left in a plate shaker at room temperature for 30 min. Twenty microliter cell suspension (please refer to Supplementary Table 4a for cell line specific seeding densities) was dispensed in the drugged plates. After 72 h 17. 10 14. 15. 16. 18. 19. 20. Al-Lazikani, B., Banerji, U. & Workman, P. Combinatorial drug therapy for cancer in the post-genomic era. Nat. Biotechnol. 30, 679 (2012). Masui, K. et al. A tale of two approaches: complementary mechanisms of cytotoxic and targeted therapy resistance may inform next-generation cancer treatments. Carcinogenesis 34, 725–738 (2013). Lehár, J. et al. Synergistic drug combinations tend to improve therapeutically relevant selectivity. Nat. Biotechnol. 27, 659–666 (2009). Holbeck, S. L. et al. The national cancer institute almanac: a comprehensive screening resource for the detection of anticancer drug pairs with enhanced therapeutic activity. Cancer Res. 77, 3564–3576 (2017). O’Neil, J. et al. An unbiased oncology compound screen to identify novel combination strategies. Mol. Cancer Therapeutics 15, 1155–1162 (2016). Day, D. & Siu, L. L. Approaches to modernize the combination drug development paradigm. Genome Med. 8, 115 (2016). Bulusu, K. C. et al. Modelling of compound combination effects and applications to efficacy and toxicity: state-of-the-art, challenges and perspectives. Drug Discov. Today 21, 225–238 (2016). Ali, S., Tonekaboni, M., Ghoraie, L. S., Satya Kumar Manem, V. & HaibeKains, B. Predictive approaches for drug combination discovery in cancer. Brief. Bioinforma. 19, 263–276 (2018). Rampášek, L., Hidru, D., Smirnov, P. & GoldenbergDr, A. vae: improving drug response prediction via modeling of drug perturbation effects. Bioinformatics 35, 3743–3751 (2019). Paltun, B. G., Mamitsuka, H. & Kaski, S. Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and networkbased approaches. Brief Bioinform. https://doi.org/10.1093/bib/bbz153 (2019). Cichonska, A. et al. Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 34, i509–i518 (2018). Cichonska, A. et al. Computational-experimental approach to drug-target interaction mapping: a case study on kinase inhibitors. PLoS Computational Biol. 13, e1005678 (2017). Costello, J. C. et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat. Biotechnol. 32, 1202 (2014). Gertrudes, J. C. et al. Machine learning techniques and drug design. Curr. Medicinal Chem. 19, 4289–4297 (2012). Lavecchia, A. Machine-learning approaches in drug discovery: methods and applications. Drug Discov. Today 20, 318–331 (2015). Griner, L. A. M. et al. High-throughput combinatorial screening identifies drugs that cooperate with ibrutinib to kill activated b-cell-like diffuse large bcell lymphoma cells. Proc. Natl Acad. Sci. USA 111, 2349–2354 (2014). Sidorov, P., Naulaerts, S., Ariey-Bonnet, J., Pasquier, E. & Ballester, P. Predicting synergism of cancer drug combinations using nci-almanac data. Front. Chem. 7, 509 (2019). Bansal, M. et al. A community computational challenge to predict the activity of pairs of compounds. Nat. Biotechnol. 32, 1213 (2014). Menden, M. P. et al. Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen. Nat. Commun. 10, 2674 (2019). Blondel, M., Fujino, A., Ueda, N. & Ishihata, M. Higher-order factorization machines. In Advances in Neural Information Processing Systems, 3351–3359 (2016). NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-19950-z 21. Rendle, S. Factorization machines. In 2010 IEEE International Conference on Data Mining, 995–1000 (IEEE, 2010). 22. Rendle, S. Factorization machines with libfm. ACM Trans. Intell. Syst. Technol. (TIST) 3, 57 (2012). 23. Shoemaker, R. H. The nci60 human tumour cell line anticancer drug screen. Nat. Rev. Cancer 6, 813–823 (2006). 24. Li, H., Li, T., Quang, D. & Guan, Y. Network propagation predicts drug synergy in cancers. Cancer Res. 78, 5446–5457 (2018). 25. Jeon, M., Kim, S., Park, S., Lee, H. & Kang, J. In silico drug combination discovery for personalized cancer therapy. BMC Syst. Biol. 12, 16 (2018). 26. Gayvert, K. M. et al. A computational approach for identifying synergistic drug combinations. PLoS Comput. Biol. 13, e1005308 (2017). 27. Wildenhain, J. et al. Prediction of synergism from chemical-genetic interactions by machine learning. Cell Syst. 1, 383–395 (2015). 28. Chen, L. et al. Prediction of effective drug combinations by chemical interaction, protein interaction and target enrichment of kegg pathways. BioMed Res. Int. 2013, 723780 (2013). 29. Morris, S. W. et al. Fusion of a kinase gene, alk, to a nucleolar protein gene, npm, in non-hodgkin’s lymphoma. Science 263, 1281–1284 (1994). 30. Vlot, A. H. C., Aniceto, N., Menden, M. P., Ulrich-Merzenich, G. & Bender, A. Applying synergy metrics to combination screening data: agreements, disagreements and pitfalls. Drug Discov. Today 24, 2286–2298 (2019). 31. Palmer, A. C. & Sorger, P. K. Combination cancer therapy can confer benefit via patient-to-patient variability without drug additivity or synergy. Cell 171, 1678–1691 (2017). 32. Boshuizen, J. & Peeper, D. S. Rational cancer treatment combinations: An urgent clinical need. Mol. Cell 78, 1002–1018 (2020). 33. Grazia, G., Penna, I., Perotti, V., Anichini, A. & Tassi, E. Towards combinatorial targeted therapy in melanoma: from pre-clinical evidence to clinical application. Int. J. Oncol. 45, 929–949 (2014). 34. Suraweera, A., O’Byrne, K. J. & Richard, D. J. Combination therapy with histone deacetylase inhibitors (hdaci) for the treatment of cancer: achieving the full therapeutic potential of hdaci. Front. Oncol. 8, 92 (2018). 35. Haas, N. B. et al. Phase ii trial of vorinostat in advanced melanoma. Invest. New Drugs 32, 526–534 (2014). 36. Ruzzolini, J. et al. Everolimus selectively targets vemurafenib resistant brafv600e melanoma cells adapted to low ph. Cancer Lett. 408, 43–54 (2017). 37. Pak, E. et al. A large-scale drug screen identifies selective inhibitors of class i hdacs as a potential therapeutic option for shh medulloblastoma. Neuro. Oncol. 21, 1150–1163 (2019). 38. Gerner, R. E., Moore, G. E. & Didolkar, M. S. Chemotherapy of disseminated malignant melanoma with dimethyl triazeno imidazole carboxamide and dactinomycin. Cancer 32, 756–760 (1973). 39. Rocca, A. et al. A phase i–ii study of the histone deacetylase inhibitor valproic acid plus chemoimmunotherapy in patients with advanced melanoma. Br. J. Cancer 100, 28–36 (2009). 40. Tyner, J. W. et al. Functional genomic landscape of acute myeloid leukaemia. Nature 562, 526–531 (2018). 41. Friedman, A. A., Letai, A., Fisher, D. E. & Flaherty, K. T. Precision medicine for cancer with next-generation functional diagnostics. Nat. Rev. Cancer 15, 747–756 (2015). 42. Ianevski, A. et al. Prediction of drug combination effects with a minimal set of experiments. Nat. Mach. Intell. 1, 568–577 (2019). 43. Haibe-Kains, B. et al. Inconsistency in large pharmacogenomic studies. Nature 504, 389–393 (2013). 44. Trofimov, M. & Novikov, A. TFFM: Tensorflow implementation of an arbitrary order factorization machine. https://github.com/geffy/tffm (2016). 45. Hall, L. H. & Kier, L.B. The molecular connectivity chi indexes and kappa shape indexes in structure-property modeling. Rev. Comput. Chem. 367–422 (1991). 46. Guha, R. et al. Chemical informatics functionality in r. J. Stat. Softw. 18, 1–16 (2007). ARTICLE 47. Luna, A. et al. rcellminer: exploring molecular profiles and drug response of the nci-60 cell lines in r. Bioinformatics 32, 1272–1274 (2015). 48. Choppa, P. C., Vojdani, A., Tagle, C., Andrin, R. & Magtoto, L. Multiplex pcr for the detection of Mycoplasma fermentans, M. hominis and M. penetrans in cell cultures and blood samples of patients with chronic fatigue syndrome. Mol. Cell. Probes 12, 301–308 (1998). 49. Gautam, P. et al. Identification of selective cytotoxic and synthetic lethal drug responses in triple negative breast cancer cells. Mol. Cancer 15, 34 (2016). Acknowledgements This work was supported by the Academy of Finland [ICT2023 programme grants 313268 to J.R.; 313266 to T.P. and 313267 to T.A. and grants 292611, 310507, 326238 to T.A.], the Cancer Society of Finland [T.A.]), the Sigrid Jusélius Foundation [T.A.], and Orion Research Foundation sr [P.G.]. The authors thank the FIMM HTB unit and especially Laura Turunen for their great help with the drug combination assays and Aleksandr Ianevski for his great help with the synergy scoring and the background distribution data for Fig. 4. The authors also acknowledge the computational resources provided by the Aalto Science-IT project as well as CSC - IT Center for Science, Finland. Author contributions H.J., T.A., A.C., S.S., T.P., and J.R. designed the research. H.J., A.C., T.P., J.R., and S.S. developed computational methods and evaluation protocols. A.C., J.D. and H.J. performed computational evaluations. P.G. designed and performed experimental evaluation. H.J., A.C., P.G., J.R., and T.A. wrote the paper, contributed by T.P. and S.S. Competing interests The authors declare no competing interests. Additional information Supplementary information is available for this paper at https://doi.org/10.1038/s41467020-19950-z. Correspondence and requests for materials should be addressed to T.A. or J.R. Peer review information Nature Communications thanks Krishna Bulusu and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Reprints and permission information is available at http://www.nature.com/reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/. © The Author(s) 2020 NATURE COMMUNICATIONS | (2020)11:6136 | https://doi.org/10.1038/s41467-020-19950-z | www.nature.com/naturecommunications 11 Publication II Heli Julkunen, Anna Cichońska, P. Eline Slagboom, Peter Würtz, Nightingale Health UK Biobank Initiative. Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general population. eLife, May 2021. © 2021 The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use and redistribution provided that the original author and source are credited. 121 RESEARCH ARTICLE Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general population Heli Julkunen1, Anna Cichońska1, P Eline Slagboom2,3, Peter Würtz1*, Nightingale Health UK Biobank Initiative1 1 Nightingale Health Plc, Helsinki, Finland; 2Molecular Epidemiology, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands; 3 Max Planck Institute for Biology of Ageing, Cologne, Germany *For correspondence: peter.wurtz@nightingalehealth. com Competing interest: See page 17 Abstract Biomarkers of low-grade inflammation have been associated with susceptibility to a severe infectious disease course, even when measured prior to disease onset. We investigated whether metabolic biomarkers measured by nuclear magnetic resonance (NMR) spectroscopy could be associated with susceptibility to severe pneumonia (2507 hospitalised or fatal cases) and severe COVID-19 (652 hospitalised cases) in 105,146 generally healthy individuals from UK Biobank, with blood samples collected 2007–2010. The overall signature of metabolic biomarker associations was similar for the risk of severe pneumonia and severe COVID-19. A multi-biomarker score, comprised of 25 proteins, fatty acids, amino acids, and lipids, was associated equally strongly with enhanced susceptibility to severe COVID-19 (odds ratio 2.9 [95%CI 2.1–3.8] for highest vs lowest quintile) and severe pneumonia events occurring 7–11 years after blood sampling (2.6 [1.7–3.9]). However, the risk for severe pneumonia occurring during the first 2 years after blood sampling for people with elevated levels of the multi-biomarker score was over four times higher than for long-term risk (8.0 [4.1–15.6]). If these hypothesis generating findings on increased susceptibility to severe pneumonia during the first few years after blood sampling extend to severe COVID-19, metabolic biomarker profiling could potentially complement existing tools for identifying individuals at high risk. These results provide novel molecular understanding on how metabolic biomarkers reflect the susceptibility to severe COVID-19 and other infections in the general population. Funding: See page 17 Received: 11 September 2020 Accepted: 02 May 2021 Published: 04 May 2021 Reviewing editor: Edward D Janus, University of Melbourne, Australia Copyright Julkunen et al. This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited. Introduction The coronavirus disease 2019 (COVID-19) pandemic affects societies and healthcare systems worldwide. Protection of those individuals who are most susceptible to a severe and potentially fatal COVID-19 disease course is a prime component of national policies, with stricter social distancing and other preventative means recommended mainly for elderly people and individuals with preexisting disease conditions. The prominent susceptibility to severe COVID-19 for people at high age has been linked with impaired immune response due to chronic inflammation caused by ageing processes (Akbar and Gilroy, 2020). However, large numbers of seemingly healthy middle-aged individuals also suffer from severe COVID-19 (Zhou et al., 2020; Atkins et al., 2020; Williamson et al., 2020); this could partly be due to similar molecular processes related to impaired immunity. A better understanding of the molecular factors predisposing to severe COVID-19 outcomes may help to explain the risk elevation ascribed to pre-existing disease conditions. From a translational point of view, this might also complement the identification of highly susceptible individuals in general population settings beyond current risk factor assessment. Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 1 of 20 Research article Epidemiology and Global Health Medicine eLife digest National policies for mitigating the COVID-19 pandemic include stricter measures for people considered to be at high risk of severe and potentially fatal cases of the disease. Although older age and pre-existing health conditions are strong risk factors, it is poorly understood why susceptibility varies so widely in the population. People with cardiometabolic diseases, such as diabetes and liver diseases, or chronic inflammation are at higher risk of severe COVID-19 and other infections including pneumonia. These conditions alter the molecules circulating in the blood, providing potential ‘biomarkers’ to determine whether a person is more likely to develop a fatal infection. Uncovering these blood biomarkers could help to identify people who are prone to life-threatening infections despite not having ever been diagnosed with a cardiometabolic disease. To find these biomarkers, Julkunen et al. studied blood samples that had been collected from 105,000 healthy individuals in the United Kingdom over ten years ago. The data showed that individuals with biomarkers linked to low-grade inflammation and cardiometabolic disease were more likely to have died or been hospitalised with pneumonia. A score based on 25 of these biomarkers provided the best predictor of severe pneumonia. This biomarker score performed up to four times better within the first few years after blood sampling compared to predicting cases of pneumonia a decade later. The same blood biomarker changes were also linked with developing severe COVID-19 over ten years after the blood samples had been collected. The predictive value of the biomarker score was similar for both severe COVID-19 and the long-term risk of severe pneumonia. Julkunen et al. propose that the metabolic biomarkers reflect inhibited immunity that impairs response to infections. The results from over 100,000 individuals suggest that these blood biomarkers may help to identify people at high risk of severe COVID-19 or other infectious diseases. Pneumonia is a life-threatening complication of COVID-19 and the most common diagnosis in severe COVID-19 patients. As for COVID-19, the main factors that increase the susceptibility for severe community-acquired pneumonia are high age and pre-existing respiratory and cardiometabolic diseases, which can weaken the lungs and the immune system (Almirall et al., 2017). Based on analyses of large blood sample collections of healthy individuals, biomarkers associated with the risk for severe COVID-19 are largely shared with the biomarkers associated with the risk for severe pneumonia, including elevated markers of impaired kidney function and inflammation and lower HDL cholesterol (Ho et al., 2020). This may indicate that these molecular markers may reflect an overall susceptibility to severe complications after contracting an infectious disease. Comprehensive profiling of metabolic biomarkers, also known as metabolomics, in prospective population studies have suggested a range of blood biomarkers for cardiovascular disease and diabetes to also be reflective of the susceptibility for severe infectious diseases (Ritchie et al., 2015; Deelen et al., 2019). Metabolic profiling could therefore potentially identify biomarkers that reflect the susceptibility to severe COVID-19 among initially healthy individuals. However, such studies require measurement of vast numbers of blood samples collected prior to the COVID-19 pre-pandemic. Conveniently, a broad panel of metabolic biomarkers have recently been measured using nuclear magnetic resonance (NMR) spectroscopy in over 100,000 plasma samples from the UK Biobank. Here, we examined if NMR-based metabolic biomarkers from blood samples collected a decade before the COVID-19 pandemic associate with the risk of severe infectious disease in UK general population settings. Exploiting the shared risk factor relation between susceptibility to severe COVID-19 and pneumonia (Ho et al., 2020), we used well-powered statistical analyses of biomarkers with severe pneumonia events to develop a multi-biomarker score that condenses the information from the metabolic measures into a single multi-biomarker score. Taking advantage of the timeresolved information on the occurrence of severe pneumonia events in the UK Biobank, we mimicked the influence of the decade lag from blood sampling to the COVID-19 pandemic on the biomarker associations, and used analyses with short-term follow-up to interpolate to a scenario of identifying individuals susceptible to severe COVID-19 in a preventative screening setting. Our Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 2 of 20 Research article Epidemiology and Global Health Medicine primary aim was to improve the molecular understanding on how metabolic risk markers may contribute to increased predisposition to severe COVID-19 and other infections. Results A flow diagram of eligible study participants and case numbers is shown in Figure 1. Clinical characteristics of the study population are listed in Table 1. Among the 105,146 UK Biobank study participants with complete data on metabolic biomarkers and severe pneumonia outcomes, and no prior history of diagnosed pneumonia, there were 2507 severe pneumonia events recorded in hospital or death registries after the baseline blood sampling (median follow-up time 8.1 years). For the severe COVID-19 analyses, there were 652 PCR-confirmed positive cases diagnosed in hospital (inferred as severe cases in this study) among the 92,725 individuals with COVID-19 data linkage available per 3rd of February 2021. The number of severe COVID-19 cases in the UK Biobank closely followed the trends in hospitalised individuals for COVID-19 in England (Figure 1—figure supplement 1). In February 2021, the age range of study participants was 49–84 years. The median duration from blood sampling to the COVID-19 pandemic was 11.2 years (interquartile range 10.0– 12.6). The prevalence of chronic respiratory and cardiometabolic diseases was similar for study UK Biobank, full cohort n = 502 639 Exclude n = 384 190 without metabolic biomarker data Participants with baseline metabolic biomarker data (random subset of the full cohort) n =118 462 Exclude biomarker outliers and samples with missing values in 37 clinically validated biomarkers (n =10 431) Individuals with complete biomarker data across clinically validated biomarkers available n = 108 031 Associations with severe pneumonia Associations with severe COVID-19 Exclude individuals with prevalent pneumonia or penumonia recorded in primary care settings or by self reports (n = 2 889) Participants with pneumonia outcome data; n = 105 142; 2507 severe incident cases (hospitalization or death) Exclude individuals without COVID19 testing data (assessment centres in Scotland and Wales) and individuals who had died before COVID-19 pandemic (n=15 306) Participants with COVID-19 data available n = 92 725; 653 severe incident cases (hospitalization) Figure 1. Flow diagram of study participants and case numbers. Overview of eligible study participants for the analysis of metabolic biomarkers for the susceptibility to severe pneumonia and COVID-19 in the UK Biobank. Case and control definitions are described in Materials and methods. The online version of this article includes the following figure supplement(s) for figure 1: Figure supplement 1. Numbers of COVID-19 positive and hospitalised individuals in the UK Biobank and the whole of England during the course of the COVID-19 pandemic. Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 3 of 20 Research article Epidemiology and Global Health Medicine Table 1. Clinical characteristics of the UK Biobank participants in the current study. Severe pneumonia (diagnosis in hospital or death record) Severe COVID-19 (diagnosis in hospital) Incident cases Controls Incident cases Individuals with NMR biomarker measures 2507 102 639 652 Controls 92 073 Age at blood sampling (median, [range]) 62 [40-70] 58 [39-70] 60 [40-70] 58 [39-70] Females (%) 44% 54% 43% 54% Body mass index (mean, kg/m2) 28.5 27.4 28.7 27.3 Cardiovascular disease (%) 17.5% 6.6% 14.7% 6.4% Diabetes (%) 9.3% 3.9% 9.2% 3.8% Lung cancer (%) 0.4% 0.1% 0.3% 0.1% Chronic obstructive pulmonary disease (%) 6.1% 0.7% 1.8% 0.8% Liver diseases (%) 1.5% 0.7% 1.7% 0.7% Renal failure (%) 3.6% 1.3% 2.9% 1.4% Dementia (%) 0.1% 0.01% 0.0% 0.01% Proportion with prevalent diseases The number of individuals analysed for severe COVID-19 is slightly lower than for severe pneumonia, since COVID-19 data were not available from assessment centres in Scotland and Wales. participants who developed severe pneumonia and those who contracted COVID-19 and required hospitalisation, with the exception of COPD. There were 33 overlapping cases between severe pneumonia and COVID-19. Metabolic biomarkers and severe pneumonia risk Figure 2A shows the associations of 37 biomarkers with severe pneumonia events occurring during the follow-up in the entire study population (n = 105 146). The biomarkers highlighted here are those with a regulatory approval for diagnostics use in the Nightingale Health NMR platform. These biomarkers span most of the different metabolic pathways captured with the NMR platform; results for all 249 metabolic measures quantified are shown in Figure 2—figure supplements 1–3. Strong associations were observed across several metabolic pathways: increased plasma concentrations of cholesterol measures, omega-3 and omega-6 fatty acid levels, histidine, branched-chain amino acids and albumin were associated with lower susceptibility to contracting severe pneumonia. Increased concentrations of monounsaturated and saturated fatty acids, as well glycoprotein acetyls (GlycA, a marker of low-grade inflammation) were associated with elevated susceptibility to contracting severe pneumonia. Since all the biomarkers are quantified in the same single measurement, we examined if even stronger associations with severe pneumonia could be obtained using a combination of multiple biomarkers. We derived this multi-biomarker combination, denoted ‘infectious disease score’, using logistic regression with LASSO for variable selection, considering the 37 clinically validated biomarkers in a half of the study population as the training set. This resulted in an infectious disease score comprised of the weighted sum of 25 biomarkers, with the weights selected by the machine learning algorithm (Supplementary file 1). Broadly similar results were obtained using all 249 metabolic measures quantified in the Nightingale Health NMR platform to derive the multi-biomarker score. The multi-biomarker infectious disease score was then tested for association with severe pneumonia in the other half of the study population. The magnitude of association for the infectious disease score was approximately twice as strong with severe pneumonia compared to any of the individual biomarkers (Figure 2B). The odds for contracting severe pneumonia was increased 67% per 1-SD increment in the infectious disease score. This corresponds to close to fourfold higher risk for Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 4 of 20 Research article Epidemiology and Global Health Medicine A Lipoprotein lipids Total−C # VLDL−C LDL−C # HDL−C Triglycerides # ● ● ● ● ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Apolipoproteins ApoB ApoA1 ApoB/ApoA1 # ● ● ● 0.7 0.8 0.9 Fatty acids Total fatty acids Omega−3 # Omega−6 PUFA # MUFA # SFA DHA ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 Fatty acid ratios Omega−3 % Omega−6 % # PUFA % # MUFA % SFA % # DHA % # PUFA/MUFA Omega−6/Omega−3 # ● ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.4 1.5 1.6 1.7 1.8 Amino acids Alanine # Glycine # Histidine # Isoleucine # Leucine # Valine # Phenylalanine # Tyrosine # Total BCAA ● ● ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 Glycolysis metabolites Glucose # Lactate # ● ● 0.7 0.8 0.9 1.0 0.9 1.0 1.1 1.2 1.3 0.9 1.0 1.1 1.2 1.3 Fluid balance Creatinine # Albumin # ● ● 0.7 0.8 Inflammation Glycoprotein acetyls # ● 0.7 0.8 1.4 1.5 1.6 1.7 1.8 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in biomarker level B Infectious disease score ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score Figure 2. Relation of baseline biomarker concentrations to future risk of severe pneumonia in the UK Biobank (n = 105 146; 2507 incident events). (A) Odds ratios with severe pneumonia (2507 hospitalisations or deaths during a median of 8 years of follow-up) for 37 clinically validated biomarkers measured simultaneously in a single assay by Nightingale Health NMR platform. (B) Odds ratio with severe pneumonia for the multi-biomarker infectious disease score. The infectious disease score comprises of the weighted sum of 25 out of 37 clinically validated biomarkers, optimised for Figure 2 continued on next page Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 5 of 20 Research article Epidemiology and Global Health Medicine Figure 2 continued association with severe pneumonia based on one half of the study population using LASSO regression. Biomarkers included in the infectious disease score are marked by #. The odds ratio for infectious disease score is evaluated in the other half of the study population (n = 52 573; 1250 events). All models are adjusted for age, sex, and assessment centre. Odds ratios are per 1-SD increment in the biomarker levels. Horizontal bars denote 95% confidence intervals. Closed circles denote p-value<0.001 and open circles p-value0.001. BCAA indicates branched-chain amino acids; DHA: docosahexaenoic acid; MUFA: monounsaturated fatty acids; PUFA: polyunsaturated fatty acids; SFA: saturated fatty acids. The online version of this article includes the following source data and figure supplement(s) for figure 2: Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 2. Figure supplement 1. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank (n = 105 146; 2507 events). Figure supplement 2. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank (n = 105 146; 2507 events). Figure supplement 3. Relation of all biomarkers measured by the Nightingale Health NMR platform to risk of severe pneumonia in UK Biobank (n = 105 146; 2507 events). contracting severe pneumonia among people in the highest quintile of the infectious disease score, compared to those with a score in the lowest quintile. To assess the robustness of the multi-biomarker score association with severe pneumonia, we adjusted the analyses for prevalent diseases and performed analyses stratified by age and sex (Figure 3). The association was attenuated by ~10% in magnitude when adjusting for, or omitting, individuals with a diagnosis of prevalent diseases at time of blood sampling (cardiovascular diseases, diabetes, lung cancer, COPD, liver diseases, renal failure, and dementia; panels 3A and 3B). The association was similar across age groups, and also for men and women analysed separately (panels 3C and 3D). To mimic the influence of the decade-long lag from blood sample collection to the COVID-19 pandemic, we tested the association of the multi-biomarker infectious disease score with severe pneumonia events occurring during 7–11 years after the blood sampling (Figure 4A). Since there were only few severe pneumonia events recorded with more than 9 years of follow-up, we could not fully mimic the decade long time lag to the COVID-19 pandemic. The risk elevation observed in this time-lag accounting scenario was only approximately half of that observed for severe pneumonia events occurring within the first 7 years (odds ratio 1.43 vs 1.75 per 1-SD, respectively; and 2.59 vs 4.27 for individuals in the highest vs lowest quintile of the infectious disease score). To interpolate to a screening scenario conducted today, we also tested the association with short-term risk of severe pneumonia by analysing events occurring within the first 2 years after the blood sampling (Figure 4B). The association magnitude in this analysis of short-term risk scenario was approximately twice as strong as for severe pneumonia events occurring more than 2 years after blood sampling (odds ratio 2.21 vs 1.59 per 1-SD; 7.95 vs 3.35 for individuals in the highest vs lowest quintile of the infectious disease score). The elevated susceptibility to severe pneumonia associated with the multi-biomarker score was therefore three to four times stronger when examining shortterm risk as compared to risk of severe pneumonia events occurring almost a decade after the blood sampling. The elevation in the short-term risk for severe pneumonia for high levels of the infectious disease multi-biomarker score remained strong when adjusting for BMI, smoking and prevalent diseases (odds ratio 6.10 for individuals in the highest vs lowest quintile; Figure 4—figure supplement 1). We further explored the risk gradient for a future onset of severe pneumonia along increasing levels of the infectious disease score, since non-linear effects could potentially facilitate the identification of thresholds for individuals at high susceptibility. Figure 5A shows the increase in the proportion of individuals who contracted severe pneumonia according to percentiles of the score. The risk increased prominently in the highest quintile, and particularly for the highest few percentiles. The time-resolved plot of the cumulative probability of severe pneumonia during follow-up is shown in Figure 5B. The susceptibility to severe pneumonia was particularly elevated among individuals with the very highest levels of the multi-biomarker infectious disease score. This was observed already during the first few years of follow-up, corroborating the results for long-term and shortterm risk shown in Figure 3. The prominent and immediate elevation in susceptibility to severe Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 6 of 20 Research article Epidemiology and Global Health Medicine A Additional adjustments ● ● ● Additional adjustments Age, sex and assessment center Age, sex, assmt. center, BMI and smoking status Age, sex, assmt. center, BMI, smoking status and prevalent diseases 1.0 1.1 1.2 1.3 1.4 ● ● ● 1.5 Age, sex, assmt. center, BMI, smoking status and prevalent disease Age, sex, assmt. center, BMI and smoking status Age, sex and assessment center ● ● 1.6 1.7 1.8 1.9 2.0 1 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score 2 ● 3 4 5 6 7 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score B Individuals with and without prevalent diseases ● ● All individuals Without prevalent diseases Without prevalent diseases All individuals ● 1.0 Individuals with and without prevalent diseases 1.1 1.2 1.3 1.4 1.5 1.6 ● 1.7 1.9 2.0 1 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score ● ● 39−53 53−61 61−70 4 5 6 7 61−70 53−61 39−53 1.2 1.3 1.4 1.5 1.6 ● ● ● 1.1 3 Age at blood sampling ● 1.0 2 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score C Age at blood sampling ● ● ● 1.8 1.7 1.8 ● ● 1.9 2.0 1 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score 2 3 4 5 6 7 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score D Men and women separately ● ● Men and women separately Men Women Women Men ● ● 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 ● 1.8 1.9 2.0 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score 1 2 3 ● 4 5 6 7 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score Figure 3. Relation of the multi-biomarker infectious disease score to future risk of severe pneumonia with additional adjustments and in subgroups (n = 52 573; 1250 incident events). (A) Odds ratios with severe pneumonia after additional adjustments for BMI, smoking status, and prevalent diseases. (B) Odds ratios with severe pneumonia in study participants with and without prevalent diseases. (C) Odds ratios by age tertiles at the time of blood sampling. (D) Odds ratios for men and women separately. All models are adjusted for age, sex, and assessment centre. The left-hand side shows odds ratios per 1-SD increment in the multi-biomarker infectious disease score, and the right-hand side odds ratios for comparing individuals in the highest Figure 3 continued on next page Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 7 of 20 Research article Epidemiology and Global Health Medicine Figure 3 continued and lowest quintiles of the score. The results are based on the validation half of the study population not used in deriving the infectious disease score (1250 events during a median of 8 years of follow-up). The online version of this article includes the following source data for figure 3: Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 3. pneumonia was also observed when limiting analyses to individuals without chronic respiratory and cardiometabolic diseases at the time of blood sampling (Figure 5—figure supplement 1). Metabolic biomarkers and severe COVID-19 Figure 6 shows the associations of the 37 clinically validated biomarkers and the infectious disease score with the future onset of severe COVID-19 (defined as PCR-confirmed positive inpatient diagnosis). Many of the individual biomarkers had significant associations (p-value<0.001) with increased risk for severe COVID-19. These biomarkers for susceptibility to severe COVID-19 include lower levels of omega-3 omega-6 fatty acids as well as albumin, and higher levels of GlycA. We observed a high concordance in the overall pattern of COVID-19 biomarker associations with severe pneumonia (Figure 2A), with a Spearman correlation of 0.89 between the overall biomarker association signatures for severe pneumonia and severe COVID-19 (Figure 7). A Mimicking the decade lag to the COVID−19 pandemic ● ● Within 7 years from blood sampling (943 events) 7−11 years from blood sampling (307 events) 1.2 1.4 7−11 years from blood sampling (307 events) Within 7 years from blood sampling (943 events) ● ● 1.0 Mimicking the decade lag to the COVID−19 pandemic 1.6 ● ● 1.8 2.0 2.2 2.4 2.6 1 2 3 4 5 6 7 10 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score Mimicking a preventative screening scenario carried out today Mimicking a preventative screening scenario carried out today B ● ● Within 2 years from blood sampling (162 events) 2−11 years from blood sampling (1088 events) ● ● 1.0 1.2 1.4 1.6 2−11 years from blood sampling (1088 events) Within 2 years from blood sampling (162 events) 1.8 2.0 2.2 ● ● 2.4 2.6 Odds ratio for severe pneumonia (95% CI), per 1−SD increment in infectious disease score 1 2 3 4 5 6 7 10 Odds ratio for severe pneumonia (95% CI), highest vs. lowest quintile of infectious disease score Figure 4. Relation of the multi-biomarker infectious disease score to long-term and short-term future risk for severe pneumonia (n = 52 573; 1250 incident events). (A) Odds ratios with severe pneumonia events occurring within the first 7 years after the blood sampling, compared to events that occurred 7–11 years after blood sampling. (B) Odds ratios for severe pneumonia occurring within and after the first 2 years of blood sampling. Models are adjusted for age, sex, and assessment centre. The left-hand side shows odds ratios per 1-SD increment in the multi-biomarker infectious disease score, and the right-hand side odds ratios for comparing individuals in the highest and lowest quintiles of the score. The results are based on the validation half of the study population that was not used in deriving the infectious disease score. The online version of this article includes the following source data and figure supplement(s) for figure 4: Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 4. Figure supplement 1. Relation of the multi-biomarker infectious disease score to long-term and short-term risk for severe pneumonia after adjustment for risk factors and prevalent diseases (n = 52 573; 1250 events). Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 8 of 20 Research article Epidemiology and Global Health Medicine B ● 10.0% ●●● ● ● ● 5.0% 0.0% ●● ● ●● ● ● ● ●● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●● ●●● ● ●● ●●● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● 0 20 40 60 80 ● ●●● 100 Cumulative incidence Percentage of severe pneumonia cases A Infectious disease score percentile range 10.0% 0−80% 80−90% 90−95% 95−97.5% 5.0% 97.5−100% 0.0% 0 Infectious disease score percentile 2.5 5 7.5 Follow−up time (years) Figure 5. Risk gradient for contracting severe pneumonia after the blood sampling according to percentiles of the multi-biomarker infectious disease score (n = 52 573; 1250 incident events). (A) Proportion of individuals who contracted severe pneumonia during a median follow-up time of 8.1 years after the blood sampling according to percentiles of the multi-biomarker infectious disease score. Each point represents approximately 500 individuals. (B) Kaplan-Meier curves of the cumulative probability for severe pneumonia in quantiles of the multi-biomarker infectious disease score. The follow-up time was truncated at 9.5 years since only a small fraction of individuals were followed longer. Results are based on the validation half of the study population that was not used in deriving the infectious disease score (n = 52,573). The corresponding plots for individuals free of baseline respiratory and cardiometabolic diseases are shown in Figure 5—figure supplement 1. The online version of this article includes the following source data and figure supplement(s) for figure 5: Source data 1. Numerical tabulation of event rates for each percentile in Figure 5A. Figure supplement 1. Risk gradients for contracting severe pneumonia by percentiles of the multi-biomarker infectious disease score among individuals without prevalent diseases at time of blood sampling (n = 46,252; 877 events). The multi-biomarker infectious disease score derived for the future onset of severe pneumonia was also robustly associated with the future onset of severe COVID-19. The odds ratio was 1.40 per 1-SD increment and 2.90 for comparing individuals in the highest quintile of the multi-biomarker infectious disease score to those in the lowest quintile. This magnitude of association with susceptibility to severe COVID-19 was similar to that observed with severe pneumonia events occurring during the interval of 7–11 years after the blood sampling. We further examined the association of the multi-biomarker infectious disease score with severe COVID-19 after adjustment or exclusion for prevalent diseases, and conducted stratified analyses for age and sex (Figure 8). The association with severe COVID-19 was attenuated, but remained significant when adjusted for BMI, smoking and prevalent diseases (panel 7A). The association magnitudes were approximately 20% weaker when limiting the COVID-19 analyses to individuals without prevalent diseases at time of blood sampling (panel 7B). There was no robust evidence of differences in association magnitude according to age (panel 7C) and odd ratios were broadly similar for men and women (panel 7D). Finally, we examined the technical repeatability and biological stability of measuring the multibiomarker infectious disease score. The measurement repeatability was high (Pearson correlation 0.94 in blind duplicate samples; Figure 9A). Even though the blood samples were primarily non-fasting, the levels of the infectious disease score remained broadly stable during 4 years based on blood samples from repeat visits (Pearson correlation 0.61 between baseline and repeat visit measurements; Figure 9B). Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 9 of 20 Research article Epidemiology and Global Health Medicine A Lipoprotein lipids Total−C # VLDL−C LDL−C # HDL−C Triglycerides # ● ● ● ● ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.0 1.1 1.2 1.3 1.4 1.5 1.2 1.3 1.4 1.5 Apolipoproteins ApoB ApoA1 ApoB/ApoA1 # ● ● ● 0.7 0.8 0.9 Fatty acids Total fatty acids Omega−3 # Omega−6 PUFA # MUFA # SFA DHA ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 1.1 Fatty acid ratios Omega−3 % Omega−6 % # PUFA % # MUFA % SFA % # DHA % # PUFA/MUFA Omega−6/Omega−3 # ● ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.2 1.3 1.4 1.5 1.2 1.3 1.4 1.5 1.3 1.4 1.5 1.3 1.4 1.5 Amino acids Alanine # Glycine # Histidine # Isoleucine # Leucine # Valine # Phenylalanine # Tyrosine # Total BCAA ● ● ● ● ● ● ● ● ● 0.7 0.8 0.9 1.0 0.9 1.0 1.1 Glycolysis metabolites Glucose # Lactate # ● ● 0.7 0.8 1.1 Fluid balance Creatinine # Albumin # ● ● 0.7 0.8 0.9 1.0 1.1 1.2 0.8 0.9 1.0 1.1 1.2 Inflammation Glycoprotein acetyls # ● 0.7 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in biomarker level B Infectious disease score ● 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score Figure 6. Relation of baseline biomarkers and multi-biomarker infectious disease score to future risk of severe COVID-19 (n = 92 725; 652 cases diagnosed in hospital). (A) Odds ratios with severe COVID-19 (defined as PCR-positive diagnosis in hospital; 652 cases out of 92 725 individuals) for 37 clinically validated biomarkers measured by NMR. (B) Odds ratio with severe COVID-19 for the multi-biomarker infectious disease score. Biomarkers Figure 6 continued on next page Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 10 of 20 Research article Epidemiology and Global Health Medicine Figure 6 continued included in the infectious disease score are marked by #. Odds ratios are per 1-SD increment in the biomarker levels. Models are adjusted for age, sex, and assessment centre. The online version of this article includes the following source data for figure 6: Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 6. Discussion Most biomarker studies on COVID–19 have focused on characterising already infected patients and their disease prognosis (Kermali et al., 2020; Shen et al., 2020; Messner et al., 2020; Dierckx et al., 2020). In contrast, in the largest blood metabolic profiling study to date, we explored biomarker associations for susceptibility to severe pneumonia and COVID-19 in general population settings. We developed a multi-biomarker score for increased susceptibility to a severe infectious disease course, and demonstrated that this biomarker score captures an increased risk for COVID-19 hospitalisation a decade after the blood sampling. The overall signature of biomarker associations was similar for the susceptibility to severe COVID19 and to severe pneumonia (Figure 7). The proportions of individuals with existing cardiometabolic diseases were also consistent for both of these infectious diseases (Table 1). We used these observations of a shared risk factor basis to draw an analogy between susceptibility to severe pneumonia and severe COVID-19, and hereby infer potential implications for preventative screening. We therefore exploited the strong statistical power and time-resolved information on severe pneumonia events for more detailed analyses than was feasible with COVID-19. This led to three important observations. First, the infectious disease multi-biomarker score was largely independent of prevalent chronic respiratory and cardiometabolic diseases (Figure 3). Second, the susceptibility to severe pneumonia was drastically elevated in the extreme tail of the multi-biomarker infectious disease score, with 5–10 times higher risk compared to individuals with normal levels of the multi-biomarker score (Figure 5). Such features might aid in establishing thresholds for identifying individuals most susceptible to a severe disease course. Third, the odds ratio of the multi-biomarker score for severe pneumonia events occurring after 7–11 years closely matched that of severe COVID-19, for which all events occurred over decade after blood sampling (Figure 4A). Yet, screening for the susceptibility to severe COVID-19 would require a strong association with the short-term risk. When confining the analyses of severe pneumonia to events occurring within the first 2 years after blood sampling, the short-term risk elevation was over four times stronger than that observed for long-term risk — individuals with high levels of the multi-biomarker score were almost 7-times more susceptible than people with low levels (Figure 4B). If similar enhancement in short-term risk extend to COVID-19, our results could potentially indicate applications for identification of individuals at high susceptibility to a severe COVID-19 disease course. However, the unavailability of metabolic biomarker data from blood samples drawn shortly prior to the pandemic prevents us from examining biomarker associations with short-term COVID-19 susceptibility, and our results should therefore be considered of hypothesis generating nature. We observed multiple blood biomarkers commonly linked with the risk for cardiovascular disease and diabetes (Soininen et al., 2015; Würtz et al., 2017; Holmes et al., 2018; Ahola-Olli et al., 2019) to also be associated with increased susceptibility to both severe pneumonia and severe COVID-19. The biomarkers span multiple metabolic pathways, including low concentrations of lipoprotein lipids, impaired fatty acid balance, decreased amino acid levels and high chronic inflammation. This is the first study to show that many of these blood biomarkers associate with susceptibility to severe infections, potentially indicating that fatty acids and amino acids should not be considered only as biomarkers for cardiometabolic risk. The associations of omega-3 and other fatty acids with the risk for severe COVID-19 may be particularly important, as these measures are more directly modifiable by lifestyle means than common markers of inflammation. The overall pattern of biomarker associations followed a characteristic metabolic signature reflective of an increased susceptibility to a severe infectious disease. This pattern of biomarker associations is broadly similar to what has previously been reported with the risk for all-cause mortality in smaller prospective cohort studies (Deelen et al., 2019). It is therefore unlikely that the identified biomarker signature is specific to the risk for severe pneumonia and COVID-19, or even specific to infectious diseases in general. We Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 11 of 20 Research article 1.4 Epidemiology and Global Health Medicine Spearman correlation: 0.89 Glycoprotein acetyls ● ● MUFA % Odds ratio for severe pneumonia (95% CI), per 1−SD increment in biomarker level ● ● Omega−6/Omega−3 ● ● 1.2 SFA % ● ● ● ● Lactate Creatinine MUFA ● ● Phenylalanine ● Glucose ● ● Tyrosine ● 1.0 Triglycerides Glycine SFA ● ● Alanine ●● ApoB/ApoA1 ●● Isoleucine VLDL−C Total fatty acids Total BCAA ● ● ● ● % ApoA1 Omega−6 ● Leucine Valine ● ApoB ● HDL−C PUFA % ● Omega−6 ● ● Total−C ●● LDL−C 0.8 PUFA/MUFA ● ●● Histidine ● ● Omega−3 % ● PUFA ● DHA % ● Omega−3 Albumin DHA ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● 0.8 1.0 1.2 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in biomarker level 1.4 Figure 7. Concordance of the overall pattern of biomarker associations with future onset of severe pneumonia and severe COVID-19. Biomarker associations with future onset of severe pneumonia (y-axis) plotted against the corresponding associations with severe COVID-19 (x-axis). The odds ratios, with adjustment for age, sex, and assessment centre, for each of the 37 clinically validated biomarkers in the Nightingale Health NMR platform are given with 95% confidence intervals in vertical and horizontal error bars. The dashed line denotes the diagonal. The online version of this article includes the following source data for figure 7: Source data 1. Numerical tabulation of odds ratios, and 95% confidence intervals for results shown in Figure 7. Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 12 of 20 Research article A Epidemiology and Global Health Medicine Additional adjustments Additional adjustments ● ● ● Age, sex and assessment center Age, sex, assmt. center, BMI and smoking status Age, sex, assmt. center, BMI, smoking status and prevalent diseases 1.0 1.1 ● ● ● 1.2 Age, sex, assmt. center, BMI, smoking status and prevalent disease Age, sex, assmt. center, BMI and smoking status Age, sex and assessment center 1.3 1.4 1.5 1.6 1.7 1.8 1 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score ● ● ● 2 3 4 5 6 7 6 7 6 7 6 7 Odds ratio for severe COVID−19 (95% CI), highest vs. lowest quintile of infectious disease score B Individuals with and without prevalent diseases ● ● All individuals Without prevalent diseases Without prevalent diseases All individuals ● ● 1.0 Individuals with and without prevalent diseases 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score C ● ● 49−63 63−72 72−84 4 5 72−84 63−72 49−63 1.2 1.3 ● ● ● 1.1 3 Age at the time of COVID−19 pandemic ● 1.0 2 Odds ratio for severe COVID−19 (95% CI), highest vs. lowest quintile of infectious disease score Age at the time of COVID−19 pandemic ● ● ● 1.4 1.5 1.6 1.7 1.8 1 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score ● 2 ● 3 4 5 Odds ratio for severe COVID−19 (95% CI), highest vs. lowest quintile of infectious disease score D Men and women separately ● ● Men and women separately Men Women Women Men ● 1.0 1.1 1.2 1.3 ● ● 1.4 1.5 1.6 1.7 1.8 Odds ratio for severe COVID−19 (95% CI), per 1−SD increment in infectious disease score 1 2 ● 3 4 5 Odds ratio for severe COVID−19 (95% CI), highest vs. lowest quintile of infectious disease score Figure 8. Relation of the multi-biomarker infectious disease score to future risk of severe COVID-19 with additional adjustments and in subgroups of the study population (n = 92,725; 652 cases diagnosed in hospital). (A) Odds ratios with severe COVID-19 after additional adjustments for BMI, smoking status and prevalent diseases. (B) Odds ratios with severe pneumonia in study participants with and without prevalent diseases at the time of blood sampling. (C) Odds ratios by age tertiles at the time of the COVID-19 pandemic. (D) Odds ratios for men and women, separately. The left-hand side shows the odds ratios per 1-SD increment in the multi-biomarker infectious disease score, and the right-hand side the odds ratios for comparing individuals in the highest and lowest quintiles of the score. All models are adjusted for age, sex, and assessment centre. Figure 8 continued on next page Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 13 of 20 Research article Epidemiology and Global Health Medicine Figure 8 continued The online version of this article includes the following source data for figure 8: Source data 1. Numerical tabulation of odds ratios, betas, standard errors, and p-values for results shown in Figure 8. propose that the overall metabolic biomarker perturbations observed here reflect molecular signals of low-grade inflammation that exacerbate disease severity, in case of both infectious and chronic diseases (Akbar and Gilroy, 2020; Bonafè et al., 2020). In line with this, prior studies have demonstrated that elevated levels of GlycA, the biomarker with the strongest weight in the infectious disease score, is associated with increased neutrophil activity and the long-term risk for fatal infections (Ritchie et al., 2015). Such over-activity of immune response from pneumonia or COVID-19 infection is known to cause tissue damage and organ dysfunction through cytokine storm, a common complication of severe COVID-19 (Mangalmurti and Hunter, 2020). While the specific biological mechanisms underpinning the blood metabolic biomarker associations with chronic and infectious diseases remain poorly understood, we emphasize that the observational character of our study does not allow us to conclude whether the biomarkers are contributing causally to increase the risk or are merely indirect risk markers. Replication of novel biomarker associations is a key aspect in observational studies. We are not aware of other prospective studies with sufficient COVID-19 hospitalisation events and NMR-based metabolic biomarker data to address this. However, a preprint of the present study featured analysis of 195 severe COVID-19 cases, based on data available in UK Biobank back in June 2020 (Julkunen et al., 2020). In the present updated analyses, with over three times the number of cases, all biomarker associations with susceptibility to severe COVID-19 were similar or stronger, and Figure 9. Technical repeatability for measuring the multi-biomarker infectious disease score and biological stability in repeat measures 4 years after the baseline blood sampling. (A) Technical repeatability of the infectious disease score assessed in blind duplicate samples. The correlation plot is based on 2863 blind duplicate plasma samples measured along with the regular measurements of ~105,000 samples in the Nightingale Health-UK Biobank initiative. (B) Biological stability of the infectious disease score based on plasma samples from 1298 individuals who attended both the baseline visit as well as a repeat visit ~4 years later at the UK Biobank assessment centres. Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 14 of 20 Research article Epidemiology and Global Health Medicine hereby provide a within-cohort replication of our initial findings. In addition, a recent study used the same metabolic biomarker panel in three cohorts of hospitalised patients and observed similar overall biomarker perturbations to be predictive of COVID-19 severity (Dierckx et al., 2020). The study also reported the multi-biomarker infectious disease score to be among the strongest biomarkers for discriminating COVID-19 severity among already hospitalised patients. Our study has both strengths and limitations. Strengths include the large sample size, which enabled the analysis of biomarkers for susceptibility to severe COVID-19 based on pre-pandemic blood samples from general population settings. We used a validated metabolic profiling platform that enables simultaneous quantification of numerous metabolic biomarkers in a scalable low-cost setup. Although the number of hospitalised COVID-19 cases was in line with the prevalence in England, we acknowledge that the statistical power was limited for prediction analyses even with close to 100,000 samples linked with COVID-19 outcome data. Furthermore, the UK Biobank study participants are not fully representative of the UK population by demographic characteristics; the individuals were enrolled on a volunteer basis and are therefore more representative of healthier individuals than average (Sudlow et al., 2015; Fry et al., 2017). Even though this is generally not a concern for investigating risk associations (Keyes and Westreich, 2019), it does limit the statistical power to explore effects of ethnicity and old age. Other limitations include the decade long duration from blood sampling to the COVID-19 pandemic. While this limits inference on how well the biomarkers predict short-term risk for severe COVID-19, our analogy with long-term risk for severe pneumonia indicates that the time lag likely attenuates the biomarker association magnitudes substantially. Conversely, the remarkably strong associations for short-term risk of severe pneumonia led us to speculate that similar enhancements in association magnitudes could also hold for severe COVID-19. However, this inference should be further tested, in particular in the light of the bacterial origin of many severe pneumonia cases and the viral origin of COVID-19. Weaker biomarker associations for severe COVID-19 compared to severe pneumonia may also arise from the UK Biobank COVID-19 data being influenced by ascertainment bias in terms of differential healthcare seeking and differential testing (Griffith et al., 2020), whereas pneumonia is anticipated to have nearly complete case ascertainment (Ho et al., 2020). In conclusion, a metabolic signature of perturbed blood biomarkers is associated with an increased susceptibility to both severe pneumonia and COVID-19 in blood samples collected a decade before the pandemic. The multi-biomarker score captures an elevated susceptibility to severe pneumonia within few years after blood sampling that is several times stronger than the risk elevation associated with many pre-existing health conditions, such as obesity and diabetes (Ho et al., 2020). If the three- to fourfold elevation in short-term risk compared to long-term risk of severe pneumonia also applies to severe COVID-19, then the metabolic biomarker profiling could potentially complement existing tools for identifying individuals most susceptible to a severe COVID-19 disease course. Regardless of the translational prospects, these results provide novel understanding on how metabolic biomarkers may reflect the susceptibility of severe COVID-19 and other infections. Materials and methods Study population Details of the design of the UK Biobank have been reported previously (Sudlow et al., 2015). Briefly, the UK Biobank recruited 502,639 participants aged 37–70 years in 22 assessment centres across the UK. All study participants had to be able to attend the assessment centres by their own means, and there was no enrolment at nursing homes. All participants provided written informed consent and ethical approval was obtained from the North West Multi-Center Research Ethics Committee. Blood samples were drawn at baseline between 2007 and 2010. The current analysis was approved under UK Biobank Project 30418. No selection criteria were applied to the sampling. Metabolic biomarker profiling From the entire UK Biobank population, a random subset of non-fasting baseline plasma samples (aliquot 3) from 118 466 individuals and 1298 repeat-visit samples were measured using highthroughput NMR spectroscopy (Nightingale Health Plc; biomarker quantification version 2020). This Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 15 of 20 Research article Epidemiology and Global Health Medicine provides simultaneous quantification of 249 metabolic biomarker measures in a single assay, including routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty acid composition, and various low-molecular-weight metabolites such as amino acids, ketone bodies, and glycolysis metabolites quantified in molar concentration units. Technical details and epidemiological applications of the metabolic biomarker data have been reviewed (Soininen et al., 2015; Würtz et al., 2017). The Nightingale NMR platform has received various regulatory approvals, including CE-mark, and 37 biomarkers in the panel have been certified for diagnostics use. We focused on this particular set of certified biomarkers, as we wanted to investigate if these markers of systemic metabolism — commonly linked to cardiometabolic diseases — could also be associated with future risk for severe infectious disease. Furthermore, these clinically validated biomarkers span most of the different metabolic pathways measured by the NMR platform and could facilitate potential translational applications as they are certified for diagnostics use and are measured simultaneously in a single assay. The mean and standard deviation of concentrations for 249 quantified metabolic biomarkers are given in Supplementary file 2. Measurements of the metabolic biomarkers were conducted blinded prior to the linkage to the UK Biobank health outcomes. The metabolic biomarker data were curated and linked to UK Biobank clinical data in late-May 2020. The metabolic biomarker dataset has been made available for the research community through the UK biobank in March 2021. Severe pneumonia outcomes We combined ICD-10 codes J12–J18 to define the pneumonia endpoint. To strengthen the analogy with the analysis of severe COVID-19, we focused on severe pneumonia events, defined as diagnosis in hospital or death records based on UK Hospital Episode Statistics data and national death registries (2507 incident cases in the current study). All analyses are based on the first occurrence of a diagnosis. Therefore, 2658 individuals with recorded hospitalisation of pneumonia prior to the blood sampling were excluded. Additionally, 346 individuals with pneumonia diagnosis recorded in primary care settings and by self-reports were also omitted from the analyses. The registry-based follow-up was from blood sampling in 2007–2010 through to 2016–2017, depending on assessment centre (850,000 person-years). Severe COVID-19 outcomes We used COVID-19 data available in the UK Biobank per 3rd of February 2021, which covers test results from 16 March to 1st of February 2021. These data include information on positive/negative PCR-based diagnosis results and explicit evidence in the microbiological record on whether the participant was an inpatient (Resource UKBiobankD, 2020). For the present analyses, we focused on PCR-positive inpatient diagnoses. These hospitalised cases are here denoted as severe COVID-19 (652 cases in the current study). COVID-19 data were not available for assessment centres in Scotland and Wales, so individuals from these centres were excluded. Individuals who had died during follow-up prior to 2018 were also excluded, since they were never exposed to COVID-19. Control group The entire study population of non-cases was used as controls in the statistical analyses (n = 102,639 for severe pneumonia and n = 92,073 for severe COVID-19, respectively). This choice of controls is consistent with the majority of publications examining risk factors for susceptibility to severe COVID19 (e.g. Ho et al., 2020; Williamson et al., 2020). It allows to address the question of whether an initially healthy person with a high value of a given biomarker is at an increased risk of eventually getting the disease outcome (severe pneumonia or COVID-19 hospitalisation) compared to people from the general population with low levels of the biomarker. This choice of controls also overcomes biases that may arise from analyses using confirmed mild infections as the control group, such as collider bias caused by non-random testing of the control group compared to the rest of the study population (Griffith et al., 2020). Prevalent diseases To examine the influence of prevalent diseases in the prospective analyses of severe pneumonia and severe COVID-19, we used the following: prevalent cardiovascular disease (ICD-10 codes I20–I25, Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 16 of 20 Research article Epidemiology and Global Health Medicine I50, I60–I64, and G45), diabetes (E10–E14), lung cancer (C33–C34, D02.2, Z85.1), chronic obstructive pulmonary disease (COPD; J43–J44), liver diseases (K70–K77), renal failure (N17–N19), and dementia (F00-F03). Statistical methods Biomarker levels outside four interquartile ranges from median were considered as outliers and excluded. All 37 biomarkers were scaled to standard deviation (SD) units prior to analyses. For biomarker association testing with severe pneumonia and with severe COVID-19 (as separate outcomes), we used logistic regression models adjusted for age, sex, and assessment centre. To examine the utility of multiple biomarkers in combination, we used a weighted sum of the biomarkers optimised for association with future risk of severe pneumonia; this multi-biomarker score was denoted as ‘infectious disease score’. To minimise the collinearity of the biomarkers, the multibiomarker score was trained using logistic regression with least absolute shrinkage and selection operator (LASSO), which uses L1 regularisation that adds penalty equal to the absolute value of the magnitude of the coefficients. The multi-biomarker infectious disease score was trained using half of the study population with complete data available for the 37 clinically validated biomarkers (n = 52,573 and 1257 severe pneumonia events) using five-fold cross-validation to optimise the regularizsation parameter l. The remaining half of the study population was used in validating the performance of the biomarker score in relation to future risk for severe pneumonia. The multi-biomarker infectious disease score was subsequently tested for association with severe pneumonia and COVID19 in logistic regression models adjusted for age, sex, and assessment centre. We further examined the effect of additional adjustment for body mass index (BMI) and smoking status (never, former, current) and prevalent diseases. The associations were also examined by omitting individuals with prevalent diseases and stratified by age and sex. In the case of severe pneumonia, we further examined the association magnitudes according to follow-up time: we used severe pneumonia events occurring during 7–11 years after the blood sampling to mimic the decade long lag from blood sampling to the COVID-19 pandemic, and severe pneumonia events occurring within the first 2 years to interpolate to the scenario of preventative COVID-19 screening carried out today. In both scenarios, the confined follow-up times were arbitrarily chosen to be as short as possible while ensuring sufficient numbers of events. Finally, to explore potential non-linear effects, the infectious disease score was plotted as a proportion of individuals who contracted severe pneumonia during follow-up when binning individuals into percentiles of the infectious disease score (Khera et al., 2018). The time-resolution was further examined by Kaplan-Meier curves of the cumulative risk for severe pneumonia. Acknowledgements The authors are grateful to UK Biobank for access to data to undertake this study (Project #30418). Additional information Competing interests Heli Julkunen: HJ is employee and holds stock options with Nightingale Health Plc. Anna Cichońska: AC is employee and holds stock options with Nightingale Health Plc. Peter Würtz: PW is employee and shareholder of Nightingale Health Plc. The other author declares that no competing interests exist. Funding Funder Author Nightingale Health Plc Heli Julkunen Anna Cichońska Peter Würtz This work, including data collection, statistical analysis and writing of the paper, was done by employees of Nightingale Health Plc. Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 17 of 20 Research article Epidemiology and Global Health Medicine Author contributions Heli Julkunen, Conceptualization, Data curation, Software, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing; Anna Cichońska, Conceptualization, Data curation, Software, Formal analysis, Supervision, Investigation, Visualization, Methodology, Writing - original draft, Project administration; P Eline Slagboom, Investigation, Methodology, Writing - original draft, Writing - review and editing; Peter Würtz, Conceptualization, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing Author ORCIDs https://orcid.org/0000-0002-4282-0248 Heli Julkunen Peter Würtz https://orcid.org/0000-0002-5832-0221 Ethics Human subjects: The UK Biobank recruited 502 639 participants aged 37-70 years in 22 assessment centres across the UK. All participants provided written informed consent and ethical approval was obtained from the North West Multi-Center Research Ethics Committee. Details of the design of the UK Biobank have been reported previously (Sudlow et al PLOS Medicine 2015). The current analysis was approved under UK Biobank Project 30418. Decision letter and Author response Decision letter https://doi.org/10.7554/eLife.63033.sa1 Author response https://doi.org/10.7554/eLife.63033.sa2 Additional files Supplementary files . Supplementary file 1. Table of weights of the biomarkers included in the multi-biomarker infectious disease score derived using LASSO regression. The table indicates the weights for the 25 biomarkers that were selected in derivation of the multi-biomarker infectious disease score, based on optimising prediction for severe pneumonia using logistic regression with LASSO in the derivation half of the study population (n = 52,573). Each biomarker was scaled to SD-units prior to the analyses. The infectious disease score was then calculated as b1X1 + b2X2 + . . . + b25X25, with Xi denoting the SD-standardised biomarker level for the ith biomarker and bi denoting the coefficient from the multibiomarker logistic regression model. DHA indicates docosahexaenoic acid; MUFA: monounsaturated fatty acids; PUFA: polyunsaturated fatty acids; SFA: saturated fatty acids. . Supplementary file 2. Mean biomarker concentrations and standard deviations, and odds ratios of all 249 biomarkers with severe pneumonia. The table indicates mean concentrations and standard deviations used for biomarker scaling. The table also includes numerical results of odds ratios of all 249 biomarkers with severe pneumonia, with corresponding 95% confidence intervals and p-values, and whether each biomarker is clinically validated and included in the multi-biomarker infectious disease score. . Transparent reporting form Data availability The data are available for approved researchers from UK Biobank. The metabolic biomarker data has been released to the UK Biobank resource in March 2021. The following dataset was generated: Year Dataset title Cichońska A, 2021 UK Biobank Nightingale biomarker https://biobank.ndph.ox. Biobank, 220 Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 Dataset URL Database and Identifier Author(s) 18 of 20 Research article Epidemiology and Global Health Medicine Julkunen H, Würtz P data ac.uk/showcase/label. cgi?id=220 References Ahola-Olli AV, Mustelin L, Kalimeri M, Kettunen J, Jokelainen J, Auvinen J, Puukka K, Havulinna AS, Lehtimäki T, Kähönen M, Juonala M, Keinänen-Kiukaanniemi S, Salomaa V, Perola M, Järvelin MR, Ala-Korpela M, Raitakari O, Würtz P. 2019. Circulating metabolites and the risk of type 2 diabetes: a prospective study of 11,896 young adults from four finnish cohorts. Diabetologia 62:2298–2309. DOI: https://doi.org/10.1007/s00125-019-05001w, PMID: 31584131 Akbar AN, Gilroy DW. 2020. Aging immunity may exacerbate COVID-19. Science 369:256–257. DOI: https://doi. org/10.1126/science.abb0762, PMID: 32675364 Almirall J, Serra-Prat M, Bolı́bar I, Balasso V. 2017. Risk factors for Community-Acquired pneumonia in adults: a systematic review of observational studies. Respiration 94:299–311. DOI: https://doi.org/10.1159/000479089, PMID: 28738364 Atkins JL, Masoli JAH, Delgado J, Pilling LC, Kuo C-L, Kuchel GA, Melzer D. 2020. Preexisting comorbidities predicting COVID-19 and mortality in the UK biobank community cohort. The Journals of Gerontology: Series A 75:2224–2230. DOI: https://doi.org/10.1093/gerona/glaa183 Bonafè M, Prattichizzo F, Giuliani A, Storci G, Sabbatinelli J, Olivieri F. 2020. Inflamm-aging: why older men are the most susceptible to SARS-CoV-2 complicated outcomes. Cytokine & Growth Factor Reviews 53:33–37. DOI: https://doi.org/10.1016/j.cytogfr.2020.04.005, PMID: 32389499 Deelen J, Kettunen J, Fischer K, van der Spek A, Trompet S, Kastenmüller G, Boyd A, Zierer J, van den Akker EB, Ala-Korpela M, Amin N, Demirkan A, Ghanbari M, van Heemst D, Ikram MA, van Klinken JB, Mooijaart SP, Peters A, Salomaa V, Sattar N, et al. 2019. A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nature Communications 10:3346. DOI: https://doi.org/10.1038/ s41467-019-11311-9, PMID: 31431621 Dierckx T, van Elslande J, Salmela H. 2020. The metabolic fingerprint of COVID-19 severity. medRxiv. DOI: https://doi.org/10.1101/2020.11.09.20228221 Fry A, Littlejohns TJ, Sudlow C, Doherty N, Adamska L, Sprosen T, Collins R, Allen NE. 2017. Comparison of Sociodemographic and Health-Related characteristics of UK biobank participants with those of the general population. American Journal of Epidemiology 186:1026–1034. DOI: https://doi.org/10.1093/aje/kwx246, PMID: 28641372 Griffith GJ, Morris TT, Tudball MJ, Herbert A, Mancano G, Pike L, Sharp GC, Sterne J, Palmer TM, Davey Smith G, Tilling K, Zuccolo L, Davies NM, Hemani G. 2020. Collider Bias undermines our understanding of COVID-19 disease risk and severity. Nature Communications 11:5749. DOI: https://doi.org/10.1038/s41467-020-19478-2, PMID: 33184277 Ho FK, Celis-Morales CA, Gray SR, Katikireddi SV, Niedzwiedz CL, Hastie C, Ferguson LD, Berry C, Mackay DF, Gill JM, Pell JP, Sattar N, Welsh P. 2020. Modifiable and non-modifiable risk factors for COVID-19, and comparison to risk factors for influenza and pneumonia: results from a UK biobank prospective cohort study. BMJ Open 10:e040402. DOI: https://doi.org/10.1136/bmjopen-2020-040402, PMID: 33444201 Holmes MV, Millwood IY, Kartsonaki C, Hill MR, Bennett DA, Boxall R, Guo Y, Xu X, Bian Z, Hu R, Walters RG, Chen J, Ala-Korpela M, Parish S, Clarke RJ, Peto R, Collins R, Li L, Chen Z, China Kadoorie Biobank Collaborative Group. 2018. Lipids, lipoproteins, and metabolites and Risk of Myocardial Infarction and Stroke. Journal of the American College of Cardiology 71:620–632. DOI: https://doi.org/10.1016/j.jacc.2017.12.006, PMID: 29420958 Julkunen H, Cichońska A, Nightingale Health UK Biobank Initiative. 2020. Blood biomarker score identifies individuals at high risk for severe COVID-19 a decade prior to diagnosis: metabolic profiling of 105,000 adults in the UK biobank. medRxiv. DOI: https://doi.org/10.1101/2020.07.02.20143685 Kermali M, Khalsa RK, Pillai K, Ismail Z, Harky A. 2020. The role of biomarkers in diagnosis of COVID-19 - A systematic review. Life Sciences 254:117788. DOI: https://doi.org/10.1016/j.lfs.2020.117788, PMID: 32475810 Keyes KM, Westreich D. 2019. UK Biobank, big data, and the consequences of non-representativeness. The Lancet 393:1297. DOI: https://doi.org/10.1016/S0140-6736(18)33067-8 Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, Natarajan P, Lander ES, Lubitz SA, Ellinor PT, Kathiresan S. 2018. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics 50:1219–1224. DOI: https://doi.org/10.1038/s41588-0180183-z, PMID: 30104762 Mangalmurti N, Hunter CA. 2020. Cytokine storms: understanding COVID-19. Immunity 53:19–25. DOI: https:// doi.org/10.1016/j.immuni.2020.06.017, PMID: 32610079 Messner CB, Demichev V, Wendisch D, Michalick L, White M, Freiwald A, Textoris-Taube K, Vernardis SI, Egger AS, Kreidl M, Ludwig D, Kilian C, Agostini F, Zelezniak A, Thibeault C, Pfeiffer M, Hippenstiel S, Hocke A, von Kalle C, Campbell A, et al. 2020. Ultra-High-Throughput clinical proteomics reveals classifiers of COVID-19 infection. Cell Systems 11:11–24. DOI: https://doi.org/10.1016/j.cels.2020.05.012, PMID: 32619549 Resource UKBiobankD. 2020. COVID-19 test results data. http://biobank.ctsu.ox.ac.uk/crystal/exinfo.cgi?src= COVID19_tests; [Accessed February 3, 2021]. Ritchie SC, Würtz P, Nath AP, Abraham G, Havulinna AS, Fearnley LG, Sarin AP, Kangas AJ, Soininen P, Aalto K, Seppälä I, Raitoharju E, Salmi M, Maksimow M, Männistö S, Kähönen M, Juonala M, Ripatti S, Lehtimäki T, Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 19 of 20 Research article Epidemiology and Global Health Medicine Jalkanen S, et al. 2015. The biomarker GlycA is associated with chronic inflammation and predicts Long-Term risk of severe infection. Cell Systems 1:293–301. DOI: https://doi.org/10.1016/j.cels.2015.09.007, PMID: 27136058 Shen B, Yi X, Sun Y, Bi X, Du J, Zhang C, Quan S, Zhang F, Sun R, Qian L, Ge W, Liu W, Liang S, Chen H, Zhang Y, Li J, Xu J, He Z, Chen B, Wang J, et al. 2020. Proteomic and metabolomic characterization of COVID-19 patient sera. Cell 182:59–72. DOI: https://doi.org/10.1016/j.cell.2020.05.032 Soininen P, Kangas AJ, Würtz P, Suna T, Ala-Korpela M. 2015. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circulation: Cardiovascular Genetics 8:192–206. DOI: https://doi.org/10.1161/CIRCGENETICS.114.000216, PMID: 25691689 Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, Liu B, Matthews P, Ong G, Pell J, Silman A, Young A, Sprosen T, Peakman T, Collins R. 2015. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine 12:e1001779. DOI: https://doi.org/10.1371/journal.pmed.1001779, PMID: 25826379 Williamson EJ, Walker AJ, Bhaskaran K. 2020. OpenSAFELY: factors associated with COVID-19 death in 17 million patients. Nature 584:430–436. DOI: https://doi.org/10.1038/s41586-020-2521-4 Würtz P, Kangas AJ, Soininen P, Lawlor DA, Davey Smith G, Ala-Korpela M. 2017. Quantitative Serum Nuclear Magnetic Resonance Metabolomics in Large-Scale Epidemiology: A Primer on -Omic Technologies. American Journal of Epidemiology 186:1084–1096. DOI: https://doi.org/10.1093/aje/kwx016, PMID: 29106475 Zhou F, Yu T, Du R, Fan G, Liu Y, Liu Z, Xiang J, Wang Y, Song B, Gu X, Guan L, Wei Y, Li H, Wu X, Xu J, Tu S, Zhang Y, Chen H, Cao B. 2020. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study. The Lancet 395:1054–1062. DOI: https://doi.org/10.1016/ S0140-6736(20)30566-3, PMID: 32171076 Julkunen et al. eLife 2021;10:e63033. DOI: https://doi.org/10.7554/eLife.63033 20 of 20 Publication III Heli Julkunen, Anna Cichońska, Mika Tiainen, Harri Koskela, Kristian Nybo, Valtteri Mäkelä, Jussi Nokso-Koivisto, Kati Kristiansson, Markus Perola, Veikko Salomaa, Pekka Jousilahti, Annamari Lundqvist, Antti J. Kangas, Pasi Soininen, Jeffrey C. Barrett, Peter Würtz. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nature Communications, February 2023. © 2023 The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 143 Article https://doi.org/10.1038/s41467-023-36231-7 Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank Received: 28 June 2022 Check for updates 1234567890():,; 1234567890():,; Accepted: 20 January 2023 Heli Julkunen 1 , Anna Cichońska1, Mika Tiainen1, Harri Koskela1, Kristian Nybo1, Valtteri Mäkelä1, Jussi Nokso-Koivisto1, Kati Kristiansson 2, Markus Perola2, Veikko Salomaa 2, Pekka Jousilahti 2, Annamari Lundqvist2, Antti J. Kangas1, Pasi Soininen1, Jeffrey C. Barrett1 & Peter Würtz1 Blood lipids and metabolites are markers of current health and future disease risk. Here, we describe plasma nuclear magnetic resonance (NMR) biomarker data for 118,461 participants in the UK Biobank. The biomarkers cover 249 measures of lipoprotein lipids, fatty acids, and small molecules such as amino acids, ketones, and glycolysis metabolites. We provide an atlas of associations of these biomarkers to prevalence, incidence, and mortality of over 700 common diseases (nightingalehealth.com/atlas). The results reveal a plethora of biomarker associations, including susceptibility to infectious diseases and risk of various cancers, joint disorders, and mental health outcomes, indicating that abundant circulating lipids and metabolites are risk markers beyond cardiometabolic diseases. Clustering analyses indicate similar biomarker association patterns across different disease types, suggesting latent systemic connectivity in the susceptibility to a diverse set of diseases. This work highlights the value of NMR based metabolic biomarker profiling in large biobanks for public health research and translation. UK Biobank is a prospective study of ~500,000 individuals who have volunteered to have their health information shared with scientists across the globe to advance public health research. This open resource is unique in its size and availability of extensive phenotypic and genomic data1–3. A selection of 30 routine blood biomarkers has previously been measured in the full cohort4,5, but there is a unique opportunity to evaluate the public health relevance of a wider range of biomarkers and accelerating translation, as exemplified by genomewide genotyping for population-based risk identification6. Here, we describe detailed metabolic biomarkers quantified by nuclear magnetic resonance (NMR) spectroscopy of 118,461 baseline plasma samples, generated by Nightingale Health Plc (Fig. 1a). The sample size is more than ten-fold larger than many of the largest metabolic profiling studies conducted to date7,8. The NMR biomarker panel comprises 249 measures of lipids and metabolites (Fig. 1b). These data are now available to approved researchers through the UK Biobank Showcase for all aspects of public health research. Many studies are already using these biomarker data, spanning applications related to, for instance, risk prediction, causal analyses, genetic discovery and drug target validation9–18. In this study, we present a comprehensive atlas of biomarkerdisease associations (available at nightingalehealth.com/atlas), systematically examined across the 249 metabolic measures in relation to presence, future onset and mortality of over 700 disease outcomes (Fig. 1c). We illustrate the use of the atlas for biomarker discovery and identification of connections between overall biomarker signatures for various diseases. We replicate the findings in over 30,000 individuals from five prospective cohorts in the Finnish Institute for Health and Welfare (THL) Biobank profiled using the same NMR platform. Our biomarker-disease atlas may serve as a starting point to move from 1 Nightingale Health Plc, Helsinki, Finland. 2Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland. e-mail: Heli.julkunen@nightingalehealth.com; Peter.wurtz@nightingalehealth.com Nature Communications | (2023)14:604 1 Article https://doi.org/10.1038/s41467-023-36231-7 Fig. 1 | Nuclear magnetic resonance (NMR) biomarker data in the UK Biobank and atlas of disease associations. a Process of the Nightingale Health-UK Biobank Initiative: 1) EDTA plasma samples from the baseline survey were prepared on 96well plates and shipped to Nightingale Health laboratories in Finland, 2) Buffer was added and samples transferred to NMR tubes, 3) Samples were measured using six 500 MHz proton NMR spectrometers, 4) Automated spectral processing software was used to quantify 249 biomarker measures from each sample, 5) Quality control metrics based on blind duplicates and internal control samples were used to track consistency metrics throughout the project, 6) Biomarker data were cleaned, provided to UK Biobank and released to the research community. b Overview of biomarker types included in the Nightingale Health NMR biomarker panel. c Schematic illustration of the atlas of biomarker-disease associations published along with this study. The webtool allows to display the associations of all biomarkers versus prevalence, incidence and mortality of each disease endpoint, as well as show each biomarker versus all disease endpoints. biomarker discovery to more detailed analyses in biological and clinical context. plasma samples were picked randomly and are therefore representative of the 502,543 participants in the full cohort. Samples were generally drawn non-fasting, with an average of 4 hours since the last meal. The data release also contains biomarker measurements of ~4000 repeat visit samples collected on average four years after the baseline, with ~1500 participants having biomarker data from both baseline and the repeat-visit survey. The Nightingale Health NMR biomarker platform quantifies 249 metabolic measures from each sample in a single experimental assay, Results Plasma biomarker profiling by NMR We measured lipid and metabolite biomarkers from 118,461 baseline plasma samples using the Nightingale Health NMR platform (Fig. 1a)7,9,19. Table 1 shows the characteristics of the participants with NMR biomarker data currently available in the UK Biobank. The EDTA Nature Communications | (2023)14:604 2 Article https://doi.org/10.1038/s41467-023-36231-7 Table 1 | Characteristics of the UK Biobank participants with plasma NMR biomarkers in the first data release from Nightingale Health Subset with NMR biomarkers, baseline Full cohort, baseline Number of participants 118,461 502,543 Age at blood sampling (median, [range]) 58 [39–71] 58 [37–73] Females (%) 54 54 Body mass index (kg/m2), mean 27.4 27.4 Smoking prevalence (regular, occasional; %) 7.9, 2.7 7.8, 2.7 Fasting time, mean (h) 3.8 3.8 Self-reported cholesterollowering medication use (%) 18 17 comprising 168 measures in absolute levels and 81 ratio measures (Fig. 1b). The biomarkers include measures already routinely used in clinical practice, such as cholesterol, as well as many emerging biomarkers increasingly measured in cohorts, such omega-3 and other fatty acids7,20. The panel of biomarkers is based on feasibility for accurate quantification in a high-throughput manner, and therefore mostly reflects molecules with high circulating concentration. Most of the biomarkers relate to lipoprotein metabolism, with the lipid concentrations and composition measured in 14 lipoprotein subclasses in terms of triglycerides, phospholipids, total cholesterol, cholesterol esters, and free cholesterol, and total lipid concentration within each subclass. The panel additionally includes the absolute concentration and relative balance of the most abundant plasma fatty acids, such as saturated fatty acids, and small molecules, like amino acids, and ketone bodies. Apolipoproteins B and A1, and two inflammatory protein measures, albumin and glycoprotein acetyls, are also measured, owing to their high abundance in plasma. Details of the NMR biomarker measurements of the UK Biobank samples are described in ‘Methods’. Key steps of the measurement process are illustrated in Supplementary Fig. 1 and an overview of all measured biomarkers is provided in Supplementary Fig. 2. The quality control protocol is described in Supplementary Methods and Supplementary Fig. 3. Coefficients of variation of the biomarkers are shown is Supplementary Fig. 4 and technical as well as biological variability illustrated in Supplementary Fig. 5. Comparisons of the NMR biomarker measurements to routine clinical chemistry is illustrated in Supplementary Fig. 6 and to other multi-biomarker assays measured in smaller cohorts in Supplementary Figs. 7 and 8. Atlas of biomarker–disease associations The extensive electronic health records in the UK Biobank and the unprecedented sample size make it possible to study biomarker associations across the full spectrum of common diseases. We systematically computed the associations of the 249 NMR biomarkers with over 700 disease endpoints. Incident and mortality endpoints were defined by 3-character ICD-10 codes from nationwide hospital episode statistics and death records for diseases with at least 50 events occurring during 10 years after blood sampling. Prevalent endpoints were defined for diseases with over 50 events in the hospital records during ~25 years before the blood sampling. Details of the data preprocessing and statistical modelling are described in Methods. We collated the results in form of an online atlas of biomarker-disease associations available at nightingalehealth.com/atlas (Fig. 1c). The webtool can display interactive forestplots for all biomarkers with prevalence, incidence, and mortality of each disease endpoint, as well as disease-wide association plots for each of the 249 biomarkers. We observed a total of 33,764 individual biomarker associations to incident disease endpoints at p < 5e-5 (Methods). Similarly, for 648 Nature Communications | (2023)14:604 prevalent disease endpoints and 77 causes of death, 26,035 and 3,055 significant associations were identified, respectively. These biomarker associations were not concentrated in cardiometabolic diseases but spread across nearly all ICD-10 chapters. Examples include infectious diseases of both systemic and local character, certain cancers as well as mental and neurological disorders and musculoskeletal diseases. The magnitudes of biomarker associations for these diverse types of diseases were often similar to those of cardiovascular diseases. In the subsequent analyses in this paper, we focus on analyses of the future onset of diseases from ICD-10 chapters A-N and the 37 biomarkers from the Nightingale Health NMR platform certified for diagnostic use. Biomarkers across the spectrum of diseases Examining the NMR biomarkers across the spectrum of common diseases can provide insights into disease pathophysiology and specificity of the biomarkers. Fig. 2a illustrates the span of diseases in different ICD-10 chapters associated with the 37 clinically certified biomarkers. Many of the biomarkers exhibited associations across all types of diseases, with the exception of diseases of the eyes and the ears. For example, monounsaturated fatty acids relative to total fatty acids (MUFA%) were associated with almost 200 different disease endpoints spanning all ICD-10 chapters A-N. Also, more established biomarkers such as omega-3% (i.e. concentration relative to total fatty acids) and routine cholesterol measures were associated with a wide spectrum of diseases. Glycolysis-related metabolites and amino acids displayed fewer associations, but still spanned more than endocrine and circulatory diseases. Figure 2b–e shows the strongest incident disease associations in detail for four exemplar biomarkers; further examples are shown in Supplementary Fig. 9. The inflammatory biomarker glycoprotein acetyls, also known as GlycA, was associated with the risk of 32% of the incident disease endpoints examined (p < 5e-5), with a median hazard ratio of 1.26 per 1-SD increment in the biomarker concentration. The most significant associations were observed for gout, type 2 diabetes, smoking dependence, kidney diseases, chronic obstructive pulmonary disorder, myocardial infarction, pneumonia and anemias. Figure 2c highlights the strongest disease associations for the ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA), showing as widespread disease associations as for GlycA. Similar results were observed also for other fatty acid measures, such as omega-3% and omega-6% as well as MUFA% (Supplementary Fig. 9a–c). By contrast to this pattern of diverse associations, some biomarkers exhibited more distinct disease specificity. For instance, the amino acid alanine was primarily associated with the risk of diabetes and complications related to diabetes (Fig. 2d). Glycine and glutamine (Supplementary Fig. 9d, e) were also associated with diabetes-related complications, but additionally with the risk of liver and kidney diseases, with lower plasma concentrations indicating higher disease risk. Glycine was also strongly associated with many circulatory disease endpoints, in line with the earlier suggested causal role of glycine levels in coronary heart disease21. Most of the biomarkers had a consistent direction of associations across different diseases, but not all. For example, higher branched-chain amino acid levels were associated with a higher risk for many metabolic diseases but a lower risk for a range of other diseases such as lung diseases, hernia and smoking dependence (Fig. 2e). A small number of biomarkers showed only weak magnitude of association across the spectrum of diseases, such as the ketone body 3-hydroxybutyrate (Supplementary Fig. 9f). Considered from the disease perspective, Fig. 3 shows the biomarker association profiles for the incidence of six exemplar diseases. Multiple biomarkers are associated with incident hospitalisation for sleep disorders, depression, lung cancer and sepsis, with magnitudes of associations generally similar to those of myocardial infarction. The majority of the biomarkers associated exclusively in one direction of 3 Article https://doi.org/10.1038/s41467-023-36231-7 Fig. 2 | Biomarkers for future disease onset across a spectrum of diseases. a Total number of incident disease associations by biomarker at statistical significance level p < 5e-5. The disease outcomes were defined based on 3-character ICD-10 codes with 50 or more events from chapters A-N, with a total of 556 diseases tested for association. The colour coding indicates the proportion of associations coming from each ICD-10 chapter from A to N. b–e Twenty most significant associations for four biomarkers: b Glycoprotein acetyls, c Ratio of polyunsaturated fatty acids to monounsaturated fatty acids (PUFA/MUFA), d Alanine, and e Branched-chain amino acids (BCAA). The forestplots highlight 20 of the most significant associations, arranged according to decreasing association magnitude. Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations. All models were adjusted for age, sex and UK biobank assessment centre, using age as the timescale of the Cox proportional hazards regression. Similar disease-wide association plots for all 249 biomarkers across all endpoints analysed are available in the biomarker-disease atlas webtool. Source data are provided as a Source Data file. effect across these diseases and exhibited similar association patterns overall. An exception to this is osteoporosis, for which increased risk was characterised by decreased concentrations of branched-chain amino acids and triglycerides, and higher high-density lipoprotein cholesterol and apolipoprotein A1—in contrast to the other diseases in Fig. 3. All biomarker associations were robust to a sensitivity analysis excluding the first two years of follow-up, suggesting that they are not driven by clinically incipient cases at baseline (Supplementary Fig. 10). Shared biomarker signatures for different diseases Nature Communications | (2023)14:604 Comparing biomarker signatures between diseases may help to understand molecular differences between conditions with similar pathophysiology and identify novel connections8,22. Figure 4a shows examples of clustering of diseases according to their overall biomarker association patterns. In the vertical direction, biomarkers such as GlycA and MUFA% cluster together due to their similarity in associations with many different types of diseases. Most amino acids cluster 4 Article https://doi.org/10.1038/s41467-023-36231-7 M81 Osteoporosis I21 Myocardial infarction G47 Sleep disorders F32 Depression C34 Lung cancer A41 Sepsis Inflammation Amino acids Glycoprotein acetyls Alanine Glycine 0.6 1.0 1.4 1.0 1.4 1.0 1.4 1.0 1.4 1.0 1.4 Histidine Apolipoproteins 0.6 1.0 1.4 ApoB/ApoA1 Aromatic amino acids ApoB ApoA1 Phenylalanine Tyrosine 0.6 0.6 1.0 1.4 Branched−chain amino acids Fatty acid ratios Omega−6/Omega−3 DHA % Isoleucine Leucine SFA % Valine PUFA % MUFA % Total BCAA 0.6 1.0 1.4 PUFA/MUFA Omega−6 % Fluid balance Omega−3 % Creatinine 0.6 Albumin Fatty acids 0.6 1.0 1.4 SFA Glycolysis related metabolites PUFA Glucose Omega−6 Lactate Total fatty acids 0.6 1.0 1.4 DHA Omega−3 Cholesterol MUFA VLDL−C 0.6 Total−C Triglycerides Clinical LDL−C HDL−C Total triglycerides 0.6 1.0 1.4 Hazard ratio (95% CI), per 1−SD increment 0.6 Hazard ratio (95% CI), per 1−SD increment Fig. 3 | Biomarker profiles for the incidence of various types of diseases. Hazard ratios of biomarkers with the incidence of six disease examples: A41 Sepsis (red; n = 117,806, 2986 events), C34 Lung cancer (light blue; n = 117,964, 1210 events), F32 Depression (green; n = 116,993, 5455 events), G47 Sleep disorders (dark blue; n = 117,325, 1865 events), I21 Myocardial infarction (orange; n = 116,797, 2523 events) and M81 Osteoporosis (lavender; n = 117,538, 3326 events). Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations. The models were adjusted for age, sex and UK biobank assessment centre, using age as the timescale of the Cox proportional hazards regression. Filled points indicate statistically significant associations (p < 5e-5), and hollow points are non-significant ones. Similar forest plots for all 249 NMR biomarkers across all endpoints analysed are provided in the biomarker-disease atlas webtool. BCAA indicates branched-chain amino acids, DHA docosahexaenoic acid; MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty acids. Source data are provided as a Source Data file. together, but glycine and histidine have deviating associations more similar to those of omega-6% and omega-3%, respectively. In the horizontal direction, the clustering analysis reveals both well-known connections between diseases and less anticipated similarities. For example, diabetes has highly similar biomarker association patterns with several of its complications, including polyneuropathies and retinal disorders. Common diseases of an infectious origin, pneumonia and general bacterial infection, also cluster together in terms of their overall biomarker association patterns, as does COPD and lung cancer. Some of the less well-known connections include, for instance, liver diseases and polyneuropathies which had almost identical overall biomarker associations as further highlighted in Fig. 4b. The biomarker signatures were similar for many diseases, but notable differences may still be observed for diseases of similar pathophysiological origin20. Figure 4c illustrates how acute myocardial infarction and hospitalisation for heart failure have many deviating biomarker associations even though these two endpoints are often combined for clinical trial analyses in the five-point major adverse cardiovascular event (MACE) definition. Supplementary Figs. 11–13 further illustrate similarities and differences in the biomarker signatures for various other types of cardiovascular diseases. The biomarker association pattern differed for different types of myocardial infarction, angina, chronic ischaemic heart disease, and different types of stroke. Even more pronounced differences were observed when compared to heart failure and peripheral artery disease. In particular, many biomarker associations appeared to be stronger for other circulatory endpoints than for myocardial infarction and ischaemic stroke. These results may suggest potential Nature Communications | (2023)14:604 5 Article benefits for risk prediction separately for these types of cardiovascular events. Replication of biomarker signatures Replication is essential in biomarker studies, no matter the sample size of the discovery analyses. We, therefore, sought to replicate the NMR biomarker associations in the UK Biobank in two ways: first by Nature Communications | (2023)14:604 https://doi.org/10.1038/s41467-023-36231-7 comparing the results to biomarkers measured by independent laboratory assays from the same UK Biobank samples, and second by analysing NMR biomarker data for over 30,000 participants from the Finnish Institute for Health and Welfare Biobank (THL biobank). Figure 5 shows the high concordance between disease associations for the eight biomarkers that have been measured by both NMR and clinical chemistry. The associations always have the same direction, 6 Article https://doi.org/10.1038/s41467-023-36231-7 Fig. 4 | Clustering of incident diseases according to their biomarker signatures. a Heatmap showing the clustering of biomarker association signatures for the incidence of a diverse set of diseases. The diseases represent three diseases from each ICD-10 chapter from A to N, selected based on the highest number of significant associations. The colouring indicates the association magnitudes in units of the effect sizes, i.e log(hazard ratio per SD). The dendrograms depict the similarity of the association patterns, computed using complete linkage clustering based on the linear correlation between the association signatures. Significant associations with p value < 5e-5 are marked with an asterisk. All models were adjusted for age, sex and UK biobank assessment centre, using age as the timescale of the Cox proportional hazards regression. Examples of overall biomarker signatures compared for incidence of b Other diseases of liver (K76) and Other polyneuropathies (G62), and c Acute myocardial infarction (I21) and Heart failure (I50). The hazard ratios for each biomarker are shown as points with 95% confidence intervals (CI) indicated in vertical and horizontal error bars. The colouring of the points indicates the significance of the biomarker association for the pair of diseases. The red lines denote a hazard ratio of 1, and the grey line denotes the diagonal. BCAA indicates branched-chain amino acids, DHA docosahexaenoic acid, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty acids. Source data are provided as a Source Data file. NMR Clinical chemistry A41 Sepsis C34 Lung cancer F32 Depression Total−C Clinical LDL−C HDL−C Total triglycerides ApoB ApoA1 Albumin Glucose 0.8 0.9 1.0 G47 Sleep disorders 1.1 0.8 0.9 1.0 1.1 0.8 I21 Myocardial infarction 0.9 1.0 1.1 1.2 M81 Osteoporosis Total−C Clinical LDL−C HDL−C Total triglycerides ApoB ApoA1 Albumin Glucose 0.6 0.7 0.8 0.9 1.0 0.7 0.8 0.9 1.0 1.1 1.2 1.3 0.9 1.0 1.1 1.2 Hazard ratio (95% CI) Fig. 5 | Comparison of nuclear magnetic resonance (NMR) and clinical chemistry biomarker associations. Hazard ratios of biomarkers for which both NMRbased (red) and clinical chemistry (blue) measurements are available, against the incidence of six disease examples: A41 Sepsis (n = 117,806, 2986 events), C34 Lung cancer (n = 117,964, 1210 events), F32 Depression (n = 116,993, 5455 events), G47 Sleep disorders (n = 117,325, 1865 events), I21 Myocardial infarction (n = 116,797, 2523 events) and M81 Osteoporosis (n = 117,538, 3326 events). Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations. The models were adjusted for age, sex and UK biobank assessment centre, using age as the timescale of the Cox proportional hazards regression. Filled points indicate statistically significant (p < 5e-5) associations, hollow points nonsignificant ones. Source data are provided as a Source Data file. and the hazard ratios are sometimes stronger for one assay and sometimes another, suggesting neither is systematically better at capturing disease association. Small deviations in the results may be because the plasma samples used for the NMR measurements were more affected by a known sample dilution issue than the corresponding serum samples used for clinical chemistry4. The consistency between the NMR-based and clinical chemistry assays in absolute concentrations is illustrated in Supplementary Fig. 7 and further discussed in Methods. We note that low-density lipoprotein (LDL) cholesterol and apolipoprotein B displayed inverse associations across a wide range of diseases, i.e. higher concentration was associated with lower risk for disease incidence (Fig. 5). This observation, which is surprising compared to the existing literature on LDL as a risk factor for heart disease, is seen in both the NMR and clinical chemistry measurements, indicating that it stems from characteristics of the UK Biobank study rather than any property of the NMR measurements. This observation was mainly explained by widespread use of lipid-lowering medications in the case of cardiovascular endpoints, since the inverse lipid associations were attenuated or inverted direction of effect when individuals on lipid-lowering medication were excluded (Supplementary Fig. 14). Nonetheless, for most non-circulatory diseases, including five of the six disease examples shown in Fig. 3, the LDL cholesterol associations remained inverse even after excluding individuals on cholesterollowering medication (Supplementary Fig. 15), warranting further investigation in other cohorts. We further replicated the associations observed in UK Biobank by a meta-analysis of five independent population-based cohorts from Finland measured using the same NMR platform (Methods; clinical characteristics listed in Supplementary Table 1). Figure 6 illustrates the consistency of the biomarker association signatures against all-cause mortality and five available incident disease outcomes. Replication results for the remaining available endpoints are shown in Supplementary Fig. 16. The biomarker associations were generally consistent in the two biobanks, especially for amino acids and other polar metabolites, fatty acid ratios and the two inflammatory protein measures. The results for absolute fatty acid concentrations deviated between the two study populations, whereas the results for fatty acid measures scaled relative to total fatty acids were highly concordant. This may suggest that such ratio measures are more easily transferrable across sampling approaches. The biomarker associations were consistent in Nature Communications | (2023)14:604 7 Article https://doi.org/10.1038/s41467-023-36231-7 THL biobank cohorts, meta−analysis UK biobank UK biobank, excluding individuals using cholesterol lowering medication All−cause mortality b Hazard ratio (95% CI) Hazard ratio (95% CI) a 1.2 1.0 1.2 1.0 0.8 Diabetes Al a G ni H lyc n e is in tid e Ap Apine Ph oB Ap oB e n /A oA y l po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C eg −3 a PU −6 % M FA % U % FA O m eg P SFA % U a F DH % To−6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU −6 M FA U FA S F C re D A at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te ce ty rid ls es Al a G ni H lyc n e is in tid e Ap Apine Ph oB Ap oB en /A oA yl po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C eg −3 a PU −6 % M FA % U % FA O m eg P SFA % a− UF DH % To 6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU −6 M FA U FA C SFA re D at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te ce ty rid ls es 0.8 c Major adverse cardiovascular event d COPD Hazard ratio (95% CI) 1.8 1.6 1.4 1.2 1.0 0.8 1.8 1.6 1.4 1.2 1.0 0.8 Al a G ni H lyc ne is in tid e Ap Apine Ph oB Ap oB en /A oA yl po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C eg −3 a PU −6 % M FA % U % FA O m eg P SFA % a− UF DH % To 6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU −6 M FA U FA C SFA re D at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te ce ty rid ls es 0.6 1.0 0.8 f Liver diseases Hazard ratio (95% CI) Chronic kidney failure 1.2 1.8 1.6 1.4 1.2 1.0 0.8 0.6 Al a G ni H lyc ne is in tid e Ap Apine Ph oB Ap oB en /A oA yl po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C eg −3 a PU −6 % M FA % U % FA O m eg P SFA % U a F DH % To−6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU −6 M FA U FA S F C re D A at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te ce ty rid ls es Hazard ratio (95% CI) e 1.4 Al a G ni H lyc n e is in tid e Ap Apine Ph oB Ap oB e n /A oA y l po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C e g −3 a PU −6 % M FA % U % FA O m eg P SFA % U a F DH % To−6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU −6 M FA U FA S F C re D A at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te ce ty rid ls es 0.6 Al a G ni H lyc n e is in tid e Ap Apine Ph oB Ap oB e n /A oA y l po 1 al A a To Tyr nin1 ta os e Is l B ine ol CA eu A Le cin uc e V in T al e C lin V ota ine ic LD l− al L C L − O HDL C m − O eg DL C m a −C eg −3 a PU −6 % M FA % U % FA O m eg P SFA % a− UF DH % To 6/ A/ A ta Om MU % lf e F at g A O ty a a−3 m c O eg ids m a eg −3 a PU − 6 M FA U FA C SFA re D at H A G i A n ly lb in co G u e To pro lucmin ta te La os l t in c e rig a ta ly ce te c e ty rid ls es Hazard ratio (95% CI) 1.6 Fig. 6 | Replication of biomarker associations with incident disease. Biomarker associations for six disease endpoints are shown for THL Biobank (red) and UK Biobank for the full study population (light blue) as well as for individuals without self-reported use of cholesterol-lowering medication (dark blue): a All-cause mortality, b Major adverse cardiovascular event, c Diabetes, d Chronic obstructive pulmonary disease (COPD), e Chronic kidney failure and f Liver diseases. Results from THL biobank were meta-analysed for five prospective Finnish cohorts (FINRISK 1997, 2002, 2007, and 2012, and Health 2000). Data are presented as hazard ratios and 95% confidence intervals (CI), per SD-scaled biomarker concentrations. All models were adjusted for age and sex, using age as the timescale of the Cox proportional hazards regression. Analyses in the UK biobank were additionally adjusted for the UK biobank assessment centre. Filled points indicate statistically significant associations (p < 5e-5), and hollow points non-significant ones. Black horizontal line denotes a hazard ratio of 1. Event numbers for incident disease or mortality in the two biobanks are shown in Table 2. ICD-10 codes used for compiling the composite endpoints are listed in Supplementary Table 2. The replication results are shown here for six endpoints available in THL biobank; results for all overlapping endpoints are shown in Supplementary Fig. 16. Results are shown separately for each of the five Finnish cohorts in Supplementary Fig. 17. BCAA indicates branched-chain amino acids, DHA docosahexaenoic acid, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty acids. Source data are provided as a Source Data file. each of the five Finnish cohorts, although there was a tendency for stronger hazard ratios for the cohort with shortest follow-up time (Supplementary Fig. 17). The greatest deviations were observed for aforementioned LDL-related biomarkers, which displayed strong inverse associations for diabetes and major adverse cardiovascular event (MACE) in UK Biobank but flat or weakly positive associations in the Finnish cohorts. By excluding participants using cholesterollowering medication in the UK Biobank, the associations generally became more consistent (Fig. 6). However, many of the inverse associations for LDL cholesterol and related lipids also replicated in the Finnish cohorts, such as in the case of all-cause mortality and chronic kidney failure (Fig. 6a, e). Nature Communications | (2023)14:604 8 Article https://doi.org/10.1038/s41467-023-36231-7 Table 2 | Sample size and number of events for replication analyses Endpoint THL Biobank Number of events/N (%) UK Biobank Number of events/N (%) UK Biobank subset Number of events/N (%) All-cause mortality 3 928/34 019 (11.55%) 7 802/117 868 (6.62%) 5 219/97 212 (5.37%) Chronic kidney failure 328/33 982 (0.97%) 4 254/117 550 (3.62%) 2 270/97 074 (2.34%) COPD 732/33 736 (2.17%) 4 404/117 141 (3.76%) 2 885/96 811 (2.98%) Liver diseases 417/33 783 (1.23%) 2 696/117 328 (2.3%) 1 884/96 828 (1.95%) MACE 4 640/31 754 (14.61%) 6 511/115 745 (5.63%) 4 311/96 885 (4.45%) Diabetes 2 703/31 565 (8.56%) 6 836/115 579 (5.91%) 3 376/96 746 (3.49%) UK Biobank subset represents subset excluding individuals with self-reported use of cholesterol lowering medication. COPD chronic obstructive pulmonary disease, MACE major adverse cardiovascular event. Age and lipid-lowering medication effects Excluding individuals using lipid-lowering medication might introduce collider bias in the findings by selecting for healthier individuals. To provide more context for evaluating these results, we also replicated the results in FINRISK 1997 cohort which has a low prevalence of cholesterol-lowering medication use due to the cohort being sampled in 1997 (3.5% in the full cohort, 4.5% after matching age to UK biobank). The results are shown in Supplementary Fig. 18, with analyses matched to the age range of UK Biobank participants. Most of the biomarker associations were consistent in this comparison and the aforementioned inverse and weak associations for LDL-related lipids observed in the UK biobank were also seen in the FINRISK 1997 cohort that is much less affected by cholesterol-lowering medication. This includes, for instance, the null association of LDL cholesterol with MACE and the inverse associations with all-cause mortality and chronic kidney failure. These results suggest that the observations made in UK Biobank after excluding cholesterol-lowering medication users are likely not primarily due to collider bias, but rather relate to the characteristics of the higher-aged individuals in UK Biobank. To provide another angle on the influence of cholesterol-lowering and other medications on the biomarker associations, we stratified the biomarker analyses by age tertiles4. As the use of cholesterol-lowering and other medications increases with age, younger age groups are less prone to such sources of bias. Fig. 7 shows age-stratified biomarker associations for 17 biomarkers across the incidence of the six exemplary diseases from Fig. 3. Results for the remaining 20 biomarkers are shown in Supplementary Fig. 19. In many cases, the association magnitudes were stronger in the youngest age tertile. In particular, notable differences were observed in the case of LDL-related biomarkers, for which the associations became weaker in the older tertiles against myocardial infarction and completely inverted direction against noncirculatory diseases, which can likely be at least partially attributed to the higher prevalence of statin use in the oldest age groups. Increased association magnitudes with younger age were also observed for biomarkers known to not be affected by lipid-lowering treatment23,24, including inflammatory protein biomarkers and several amino acids, suggesting that the effects cannot be entirely attributed to a lower prevalence of statin use among the younger individuals. Comparison of the age stratified association estimates across all endpoints analysed are available in the biomarker-disease atlas webtool. Discussion Detailed biomarker profiling is a key part of the promise of precision medicine initiatives to transform preventative healthcare. Blood biomarkers provide modifiable molecular measures which relate to future health outcomes and serve as intermediates between lifestyle factors and disease risk. This study describes the generation of NMR biomarker data by Nightingale Health in the UK Biobank, which is currently the world’s largest resource of metabolic biomarkers linked to health records. These data greatly extend the blood biomarker coverage in the UK Biobank and provide a wide span of molecular Nature Communications | (2023)14:604 biomarkers not commonly measured in clinical practice, including amino acids, ketones and fatty acids. With over 118,000 plasma samples profiled in the UK Biobank, the addressable research questions extend vastly beyond biomarker discovery and the large sample size benefits, for example, causal analyses and risk prediction9,13,14,17. Due to the streamlined data access policy in UK Biobank, the data release opens possibilities for the research community to use the entire epidemiological toolbox to study the NMR biomarkers in relation to public health. The biomarkers in the Nightingale Health NMR platform are typically denoted ‘metabolic biomarkers’, and most prior studies on the data have focused on cardiometabolic diseases. Our analyses reveal that many of these biomarkers capture risk for many other diseases as well. This includes the future onset of diseases of the joints, bones, lungs, many different cancers as well as many mental disorders diseases and severe infectious diseases. These results explain earlier reports on strong associations of the NMR biomarkers with all-cause mortality25, since many of the biomarkers are associated broadly with leading causes of morbidity and mortality. Widespread associations across different diseases are known for inflammatory biomarkers such as GlycA26,27, but it has not previously been shown for circulating fatty acids, amino acids or many detailed lipoprotein measures. For example, MUFA% was the biomarker associated across the highest number of endpoints and showed similar disease clustering as GlycA. Our results of widespread disease associations for many fatty acid ratios may suggest that these biomarkers should be considered as markers of systemic inflammation more so than of recent diet. Plasma metabolites are increasingly understood to link to multimorbidities8,27. This is strongly reinforced by our discovery of biomarker associations with the full spectrum of common diseases. We observed that a broad range of diseases with different pathophysiology were characterised by similar biomarker association profiles. For example, severe infectious diseases had similar biomarker signatures to, for instance, chronic respiratory diseases as well as urinary and renal diseases. A potential explanation may be that many of the biomarkers reflect the innate immune system’s ability to respond. This would help to explain why many of the biomarkers were associated with susceptibility to severe infectious diseases, such as hospitalisation and death from sepsis, fungal infections and pneumonia9. These observations illustrate how novel insights beyond individual diseases can be gained by studying overall biomarker signatures and numerous disease outcomes simultaneously. The genomic data in UK Biobank may help to elucidate causality of these results via Mendelian randomisation11,17. The striking similarity of the biomarker risk profiles across various diseases might pose challenges to certain clinical applications requiring high disease specificity. However, it is ideal when aiming to use the biomarker panel to assess the risk of multiple diseases and overall health status simultaneously based on a single measurement. This could potentially be used for individualised health assessment at scale to prioritise high-risk individuals for further examinations and guide 9 Article https://doi.org/10.1038/s41467-023-36231-7 1st tertile (39−53, statin use 6%) 2nd tertile (54−61, statin use 17%) 3rd tertile (62−71, statin use 30%) A41 Sepsis C34 Lung cancer F32 Depression Cholesterol Cholesterol Cholesterol Apolipoproteins Apolipoproteins Apolipoproteins Triglycerides Triglycerides Triglycerides Fatty acid ratios Fatty acid ratios Fatty acid ratios Inflammation Inflammation Inflammation Clinical LDL−C HDL−C Total−C VLDL−C ApoA1 ApoB ApoB/ApoA1 Total triglycerides DHA % Omega−3 % Omega−6 % Omega−6/Omega−3 MUFA % PUFA % PUFA/MUFA SFA % Glycoprotein acetyls 0.6 1.0 1.4 1.8 G47 Sleep disorders 0.6 1.0 1.4 1.8 I21 Myocardial infarction 0.6 1.0 1.4 1.8 1.0 1.4 1.8 M81 Osteoporosis Cholesterol Cholesterol Cholesterol Apolipoproteins Apolipoproteins Apolipoproteins Triglycerides Triglycerides Triglycerides Fatty acid ratios Fatty acid ratios Fatty acid ratios Inflammation Inflammation Inflammation Clinical LDL−C HDL−C Total−C VLDL−C ApoA1 ApoB ApoB/ApoA1 Total triglycerides DHA % Omega−3 % Omega−6 % Omega−6/Omega−3 MUFA % PUFA % PUFA/MUFA SFA % Glycoprotein acetyls 0.6 1.0 1.4 1.8 0.6 1.0 1.4 1.8 0.6 Hazard ratio (95% CI), per 1−SD increment Fig. 7 | Age-stratified biomarker profiles for the onset of various types of diseases. Biomarker profiles stratified by age tertiles: 1st tertile (3–53 years of age; dark blue), 2nd tertile (54–61 years of age; red) and 3rd tertile (62–71 years of age; green). Results are shown for 17 biomarkers across six disease examples: A41 Sepsis (n = 117,806, 2986 events), C34 Lung cancer (n = 117,964, 1210 events), F32 Depression (n = 116,993, 5455 events), G47 Sleep disorders (n = 117,325, 1865 events), I21 Myocardial infarction (n = 116,797, 2523 events) and M81 Osteoporosis (n = 117,538, 3326 events). Results for the remaining 20 biomarkers are shown in Supplementary Fig. 19. Data are presented as hazard ratios and 95% confidence Nature Communications | (2023)14:604 intervals (CI), per SD-scaled biomarker concentrations. The models were adjusted for age, sex and UK biobank assessment centre, using age as the timescale of the Cox proportional hazards regression. Filled points indicate statistically significant associations (p < 5e-5), and hollow points non-significant ones. Similar forest plots for all 249 NMR biomarkers across all endpoints analysed are provided in the biomarker-disease atlas webtool. DHA indicates docosahexaenoic acid, MUFA monounsaturated fatty acids, PUFA polyunsaturated fatty acids, SFA saturated fatty acids. Source data are provided as a Source Data file. 10 Article preventative actions. In fact, a recently published study28 demonstrated the potential of the NMR biomarker profiles to predict multi-disease outcomes, showing predictive improvements over comprehensive clinical risk factors which were largely shown to translate into clinical utility. As such, this could have many applications in clinical settings and provide an attractive tool for multi-disease risk screening. Our biomarker-disease atlas published with this paper can be used to rapidly corroborate or refute many prior biomarker studies. For instance, we replicate the recent reports on higher branched-chain amino acid concentrations associated with lower risk for Alzheimer’s disease and dementia29. The event numbers for these neurodegenerative diseases in UK Biobank alone are similar to those in the metaanalysed eight cohorts. The biomarker-disease atlas may also be used to put into question other reported biomarker discoveries, such as branched-chained amino acids in relation to risk for pancreatic cancer:30 the association was essentially flat in UK Biobank despite a similar number of events. These examples illustrate how the biomarker-disease atlas may speed up research and serve as a starting point for analyses that yield deeper aetiological insights and clinical context, much as widely available GWAS summary statistics transformed the interpretation of genetic studies. We note that the availability of the NMR biomarker data in UK Biobank does not diminish the relevance of having these data in smaller cohorts, both for replication and for complementary study designs. For example, the precise estimates of biomarker associations in UK Biobank can make analyses of smaller cohorts and trials more interpretable in relation to longitudinal sampling and intervention effects. Metabolic profiling of all 500,000 baseline plasma samples in UK Biobank is underway. This will greatly expand the possibilities for studying rarer diseases and prediction of short-term risk, as well as open possibilities for analyses focusing on individuals with prevalent disease and multi-morbidity trajectories. Coupled with the rich genomic data, clinical chemistry and proteomics measures, imaging, complete health-records, and other health-related data that are continually added to the UK Biobank resource, the NMR biomarker data will enhance the possibilities for scientific discovery and is set to yield important findings for public health and clinical use. The data are available to approved researchers through similar access protocols as existing UK Biobank data (http://ukbiobank.ac.uk/). Methods UK Biobank cohort The UK Biobank study was approved by the North West Multi-Centre Research Ethics Committee and all participants provided written informed consent. The study protocol is available online (https://www. ukbiobank.ac.uk). The biomarker profiling of plasma samples by NMR spectroscopy was approved under UK Biobank Project 30418. The UK Biobank resource is a globally accessible biomedical database of half a million UK participants aged 40–69 years at baseline1. Baseline characteristics of the full cohort and the subset with available NMR biomarker data are provided in Table 1. A large variety of health information has been collected for each participant. For instance, the database includes questionnaire data on participant’s socio-economic and lifestyle factors, cognitive tests, imaging data, heart and lung function measures, body size and composition measures. Extensive genomic data is available, with genotyping array and exome-sequencing data available for all participants, and wholegenome sequencing under way2. The UK Biobank blood sample collection was undertaken at baseline in 22 local assessment centres across the UK between 2007 and 2010. The blood sample handling and storage protocol has been previously described31. Prior to the measurement of the NMR biomarkers, 35 biomarkers have been measured from blood and urine samples by clinical chemistry4,5. Nature Communications | (2023)14:604 https://doi.org/10.1038/s41467-023-36231-7 Plasma biomarker profiling by NMR Nightingale Health Plc. is performing biomarker profiling of baseline plasma samples for all 500,000 participants in the UK Biobank. Details of the Nightingale Health NMR biomarker platform have been described previously7,19. The main steps in the experimental procedures are illustrated in Supplementary Fig. 1. The biomarker measurements took place in Finland between 2019 and 2020 using six NMR spectrometers. The first data release covers biomarker measurements from a random selection of 118,461 EDTA plasma samples from the baseline recruitment. In addition, around 4000 EDTA plasma samples from repeat assessments are included in the same data release, with both baseline and repeat-visit sample measured for ~1500 participants. The NMR biomarker dataset has been made available for the research community through the UK Biobank in March 2021. All sample analysis processes were performed according to the standard operating procedures that are part of Nightingale Health’s EN ISO 13485 certified Quality Management System (certified by DEKRA Certification B.V. Nightingale Health measured all plasma samples with a CE-marked In Vitro Diagnostic Medical Device. At time of completion of UK Biobank phase 1 samples, 37 of the biomarkers in the panel were CE-marked and certified for diagnostics use. In order to facilitate translational applications and visualisation of the results, we focused on this set of 37 clinically validated biomarkers in the examples highlighted in the paper, as they span most of the different metabolic pathways measured by the NMR platform. Complete results for all 249 biomarkers measured are provided in the biomarker atlas webtool. Plasma sample preparation. EDTA plasma samples from aliquot 3 were prepared in 96-well plates by UK Biobank laboratory (Stockport, UK). At least 90 μL of plasma was aliquoted in each well using TECAN freedom EVO 150 robotic liquid handlers, which have coefficients of variation (CV) in pipetting volume at <0.75% across 8 tips. The plasma samples were shipped to Nightingale Health laboratories in Finland in 96-well plates on dry ice in batches of 5000–20,000 samples. No selection criteria were applied to the sampling and the 118,461 samples are therefore a random subset of the full cohort. Samples were stored in a freezer at −80 °C at Nightingale Health laboratories after arrival from UK Biobank laboratory. Before preparation, frozen samples were slowly thawed at +4 °C overnight, and then mixed gently and centrifuged (3 min, 3400 × g, +4 °C) to remove possible precipitate. Aliquots of each sample were transferred into 3-mm outer-diameter NMR tubes and mixed in 1:1 ratio with a phosphate buffer (75 mM Na2HPO4 in 80%/20% H2O/D2O, pH 7.4, including also 0.08% sodium 3-(trimethylsilyl) propionate-2,2,3,3-d4 and 0.04% sodium azide) automatically with an automated liquid handler (PerkinElmer Janus Automated Workstation). NMR spectroscopy. The plasma samples were measured using six 500 MHz NMR spectrometers (Bruker AVANCE IIIHD). Measurements were conducted blinded prior to the linkage to the UK Biobank health outcomes. The prepared plasma samples on 96-well plates were loaded onto a cooled sample changer, which maintains the temperature of samples waiting to be measured at +6 °C. Two NMR spectra were recorded for each plasma sample. The first spectrum is a presaturated proton spectrum, which features resonances arising mainly from proteins and lipids within various lipoprotein particles. The second spectrum is a Carr-Purcell-Meiboom-Gill T2-relaxation-filtered spectrum where most of the broad macromolecule and lipoprotein lipid signals are suppressed, leading to enhanced detection of lowmolecular-weight metabolites. Quantified biomarkers. The biomarkers were quantified using Nightingale Health’s proprietary software (quantification library 2020), which simultaneously quantifies 249 metabolic measures per EDTA plasma sample, comprising 168 absolute and 81 ratio measures 11 Article (Supplementary Fig. 2). All the biomarkers are of known-identity. The biomarker measures include routine lipids, lipoprotein subclass profiling with lipid concentrations within 14 subclasses, fatty acid composition, and various low-molecular-weight metabolites such as amino acids, ketone bodies and glycolysis metabolites quantified in molar concentration units. For 14 lipoprotein subclasses, the lipid concentrations and composition are measured in terms of triglycerides, phospholipids, total cholesterol, cholesterol esters, and free cholesterol, and total lipid concentration within each subclass. The majority of the biomarkers are measured in absolute concentration units (mmol/L). The 37 biomarkers in the panel which have been certified for diagnostics use (CE-marked) are marked by asterisks in Supplementary Fig. 2. The average biomarker detection rate was >99% across the plasma samples. The quality control protocol is described in Supplementary Methods and illustrated in Supplementary Fig. 3. The distribution of coefficients of variation of the biomarkers for UK Biobank’s blind duplicate samples as well as Nightingale Health’s internal control samples is shown in Supplementary Fig. 4. The coefficients of variation for each biomarker is given in the UK Biobank data resource (https:// biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220). This resource also contains distribution plots showing the consistency over consecutive shipment batches and in different NMR spectrometers, as well as scatter plots on the technical repeatability from blinded duplicate samples and the biological consistency in repeat-visit samples drawn from the same individuals four years apart. These technical and biological repeatability assessments are illustrated with GlycA as an example in Supplementary Fig. 5. Supplementary Methods further contain notes about the quality flags for samples and biomarkers as well as general recommendations for data processing in relation to epidemiological analyses. Plasma sample dilution issue. All UK Biobank blood samples are known to suffer from unintended dilution during the initial sample storage process at UK Biobank facilities. Prior reports have suggested that samples from aliquot 3, used for the NMR measurements, suffer from 5-10% dilution4. The dilution is believed to come from mixing of participant samples with water due to seals that failed to hold a system vacuum in the automated liquid handling systems. While this issue is likely to have an impact on some of the absolute biomarker concentration values, it is expected to have limited impact on most epidemiological analyses. However, we recommend that this aspect is considered when conducting analyses that rely on absolute concentrations, such as stratification based on biomarker concentration cutpoints. This may also cause challenges to compare distributions of biomarker concentrations with those observed in other cohort studies. We, therefore, caution against using the concentrations observed in UK Biobank as reference levels for translational applications. Comparison to clinical chemistry. The consistency between lipids, apolipoproteins, creatinine, albumin and glucose measured by routine clinical chemistry and Nightingale Health NMR is illustrated in Supplementary Fig. 6. For these comparisons, it is important to note that the clinical chemistry in UK Biobank was measured from serum samples, primarily from aliquot 1, while the NMR biomarkers were measured from EDTA plasma samples from aliquot 3. The different aliquots are affected by different degrees of dilution, with aliquot 3 being 5–10% diluted while aliquot 1 has almost no dilution4. Supplementary Fig. 6 therefore also shows the measurement consistency in the FinHealth 2017 study, without the dilution issue. This study is a population-based cohort under the Finnish Institute for Health and Welfare (THL) Biobank with n ~ 6000. In the FinHealth 2017 cohort, clinical chemistry assays were measured from frozen serum samples soon after the cohort survey and the NMR biomarkers one year later Nature Communications | (2023)14:604 https://doi.org/10.1038/s41467-023-36231-7 from frozen samples using the Nightingale Health platform on 350 μL aliquots of serum. Correlations between the clinical chemistry assays and NMR were high in both cohorts, but the overall consistency was weaker in UK Biobank compared with the FinHealth 2017 study. In particular, the absolute concentrations were deviating more from the diagonal in UK Biobank in than in the FinHealth 2017 study, owing to the sample dilution issue in UK Biobank. Other aspects contributing to mismatch in absolute concentrations in UK Biobank are subtle differences in biomarker levels between serum and EDTA plasma and longer differences in sample storage time. The consistency of the NMR biomarkers with clinical chemistry in the FinHealth 2017 study is in line with earlier studies that have reported correlation coefficients R > 0.92. A recent paper reported correlations of the same NMR biomarkers with clinical chemistry for FINRISK cohorts under the THL Biobank to be R~0.95 for the newest sample collection, and R~0.90 for the oldest sample collections20. Note that ‘Clinical LDL cholesterol’ is the NMR-based measure that provides concentrations consistent with clinical chemistry and the Friedewald equation for LDL-cholesterol. We further note that the correlation coefficient for albumin was weaker in the UK Biobank than observed for the other clinical chemistry measures. However, the associations of albumin with disease outcomes were broadly similar for albumin for both assays as shown in Fig. 5. Comparisons of the NMR biomarkers with overlapping biomarkers from commercial mass-spectrometry assays and gas chromatography fatty acid assays in smaller cohorts are described in Supplementary Methods and scatter plots of the consistency illustrated in Supplementary Figs. 7 and 8. Disease outcome definitions Prevalent, incident and mortality disease outcomes were derived from UK Hospital Episode Statistics data and national death registries. A diagnosis in hospital or death record formed the basis of the disease endpoint definition. Primary care records were not used. Disease endpoints were defined based on the first occurrence of 3-character ICD-10 code using the hospital inpatient and death register data (January 2021 update). To extend the follow-time prior to the introduction of ICD-10 in 1995, ICD-9 codes were mapped to the corresponding 3-character ICD-10 codes using general equivalence mappings from Center for Disease Control (https://ftp.cdc.gov/pub/Health_Statistics/ NCHS/Publications/ICD10CM/2018/). A prevalent event was defined as an event that occurred before the date of participant’s baseline visit when a blood sample was collected. Individuals with corresponding prevalent event for each outcome were excluded from the analysis of incident disease, but not for analyses of mortality outcomes. The occurrence of both primary and secondary diagnoses codes was considered to form the endpoints. The follow-up of hospitalisations ended on November 30, 2020 in England, October 31, 2020 in Scotland, and February 28, 2018 in Wales. The follow-up of death registry ended on November 30, 2020. We omitted disease outcomes with fewer than 50 cases from the analyses. This led to a total of 648 prevalent, 717 incident and 77 mortality outcomes for the study population with NMR biomarker data available. For the examples highlighted in this paper, we focused on 556 incident disease outcomes from ICD-10 chapters A-N. The selection of chapters A-N excludes pregnancy-related outcomes, conditions originating in the perinatal period and congenital malformations, deformations and chromosomal abnormalities (chapters O-Q) as there were not enough incident events passing the criteria of over 50 events to be included in the analyses. Chapters R-U (symptoms, signs and laboratory findings not elsewhere classified, injuries, accidents and factors influencing health status and contact with health care services and codes for special purposes) were excluded to place the focus on common diseases. 12 Article Biomarker association analyses across all endpoints For the disease association analyses, biomarker values outside four interquartile ranges from median were considered outliers and excluded from the analyses. Furthermore, biomarker values were corrected for the NMR spectrometer used for the measurements by fitting a linear regression model with log1p-transformed concentrations as the outcome and spectrometer as the predictor. Scaled residuals from this regression were used as predictors in the association analyses. Log1p stands for the natural logarithm of 1 + x. We used Cox proportional hazard modelling to estimate associations between biomarkers and incident disease outcomes (hospitalisation or death) across all endpoints with 50 or more events. The models were adjusted for sex and UK biobank assessment centre, using age as the time scale of the Cox proportional hazards regression. Associations for each biomarker-disease pair were computed separately. For biomarker association testing with prevalent diseases, we used logistic regression models adjusted for age, sex and assessment centre. Hazard ratios and odds ratios are reported per SD increment in the log1p-transformed biomarker concentrations in order to allow comparison of association magnitudes for measures with different units and concentration range. Sex-specific analyses were conducted for 148 female-specific and 18 male-specific diseases (Supplementary Table 3). These association analyses were performed in a subset containing only the specific sex, using the same approach without the inclusion of sex as a covariate. We also performed analyses by stratifying the UK biobank population into age tertiles (1st tertile 39-53 years of age, statin use 6%; 2nd tertile 54-61 years of age, statin use 17%; 3rd tertile 62-71 years of age, statin use 30%). In the biomarker-disease atlas, results are reported for all conducted analyses and the webtool allows to filter by a desired significance level. In this paper, we use a multiple testing-corrected significance level of 5 × 10−5 for reporting statistically significant associations, i.e. correcting for 1000 independent tests to account for both high correlation between the NMR biomarkers (~50 independent tests7) and correlations between the disease endpoints analysed. Clustering analyses For clustering analyses, a dendrogram and heatmap were computed based on the association magnitudes of the 37 biomarkers with three diseases from each ICD-10 chapter from A to N. The diseases were selected based on the highest number of significant biomarker associations in each ICD-10 chapter. The 37 biomarkers selected are the ones clinically validated in the Nightingale Health NMR platform. Biomarkers are clustered in the dendrogram based on disease association profiles, and diseases are clustered based on biomarker profiles, using complete linkage clustering based on linear correlation between the association signatures. https://doi.org/10.1038/s41467-023-36231-7 Fourteen disease endpoints were used for replication analyses in THL Biobank, selected based on the outcome data made available to Nightingale Health Plc. The disease outcome definitions were predefined by THL Biobank based on a combination of national hospital and cause-of-death registries (Supplementary Table 2). The registrybased follow-up cover virtually all diseases leading to hospitalisation or death in Finland. Follow-up data for the present study were until the end of 2016. For the replication analyses, we defined similar endpoints in UK Biobank based on the ICD-10 codes listed in Supplementary Table 2. The association analyses were for incident disease, so individuals with prevalent disease of the same endpoint were omitted. The hazard ratios were computed separately in each cohort using Cox proportional hazards regression adjusted for sex and using age as the time scale of the regression. Results from the individual cohorts were metaanalysed using inverse variance weighting. Similar to the analyses in UK Biobank, hazard ratios are reported in SD-scaled units. Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Data availability The Nightingale Health NMR biomarker data have been released to the UK Biobank resource in spring 2021 (https://biobank.ndph.ox.ac. uk/showcase/label.cgi?id=220). The UK Biobank data are available for approved researchers through the UK Biobank data-access protocol. NMR spectral data are not available as they are outside of the scope of the Nightingale-UK Biobank initiative. Instructions for the data access process, timeframe and restrictions imposed on the data are described at https://www.ukbiobank.ac.uk/enable-your-research/ apply-for-access. The average number of weeks from application submission to data release is 15 weeks for UK Biobank. Nightingale Health NMR biomarker data from FINRISK and Health 2000 cohorts, used for replication in this study, are available for approved researchers through THL Biobank. Instructions for the data access process is provided at https://thl.fi/en/web/thl-biobank/forresearchers/application-process. We provide access to all biomarker-disease summary statistics for non-commercial use through an interactive webtool https://nightingalehealth.com/atlas (CCBY-NC-ND 4.0 license). Source data are provided with this paper. Code availability Code used in this study is available at: https://github.com/ NightingaleHealth/ukb-nightingale-biomarker-atlas. Analyses were performed using R (completed and tested with version 4.1.1). References Replication in additional cohorts To replicate biomarker associations from the UK Biobank, we used data from five prospective population-based studies administered under the Finnish Institute for Health and Welfare (THL) Biobank: FINRISK 1997, FINRISK 2002, FINRISK 2007, FINRISK 2012 and Health 2000. Each cohort is an independent random sample drawn from people aged 25-98 (25–74 in FINRISK, 30 and over in Health 2000) in the Finnish population. Baseline characteristics of these cohorts are provided in Supplementary Table 1. The study participants are unique in each cohort. Baseline blood samples were collected for ~85% of all participants enroled. Venous blood was drawn non-fasting, but with recommended minimum of 4-h fast. Biomarker profiling by the Nightingale Health NMR platform was conducted from frozen serum samples for all participants during 201820. The cohort studies were approved by the Coordinating Ethical Committee of the Helsinki and Uusimaa Hospital District, Finland. Written informed consent was obtained from all participants. Nature Communications | (2023)14:604 1. 2. 3. 4. 5. 6. Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Med. 12, e1001779 (2015). Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). Szustakowski, J. D. et al. Advancing human genetics research and drug discovery through exome sequencing of the UK Biobank. Nat. Genet. 53, 942–948 (2021). Allen, N. E. et al. Approaches to minimising the epidemiological impact of sources of systematic and random variation that may affect biochemistry assay data in UK Biobank. Wellcome Open Res. 5, 222 (2021). Sinnott-Armstrong, N. et al. Genetics of 35 blood and urine biomarkers in the UK Biobank. Nat. Genet. 53, 185–194 (2021). Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018). 13 Article 7. Würtz, P. et al. Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies. Am. J. Epidemiol. 186, 1084–1096 (2017). 8. Pietzner, M. et al. Plasma metabolites to profile pathways in noncommunicable disease multimorbidity. Nat. Med. 27, 471–479 (2021). 9. Julkunen, H., Cichońska, A., Slagboom, P. E. & Würtz, P. Nightingale Health UK Biobank Initiative. Metabolic biomarker profiling for identification of susceptibility to severe pneumonia and COVID-19 in the general population. eLife 10, e63033 (2021). 10. Smith, C. J. et al. Integrative analysis of metabolite GWAS illuminates the molecular basis of pleiotropy and genetic correlation. eLife 11, e79348 (2022). 11. Borges, M. C. et al. Role of circulating polyunsaturated fatty acids on cardiovascular diseases risk: analysis using Mendelian randomization and fatty acid genetic association data from over 114,000 UK Biobank participants. BMC Med. 20, 210 (2022). 12. Liu, J. et al. Longitudinal analysis of UK Biobank participants suggests age and APOE-dependent alterations of energy metabolism in development of dementia. medRxiv https://doi.org/10.1101/2022. 02.25.22271530 (2022). 13. Bragg, F. et al. Predictive value of circulating NMR metabolic biomarkers for type 2 diabetes risk in the UK Biobank study. BMC Med. 20, 159 (2022). 14. Richardson, T. G. et al. Characterising metabolomic signatures of lipid-modifying therapies through drug target mendelian randomisation. PLOS Biol. 20, e3001547 (2022). 15. Nag, A. et al. Assessing the contribution of rare-to-common proteincoding variants to circulating metabolic biomarker levels via 412,394 UK Biobank exome sequences. medRxiv https://doi.org/10. 1101/2021.12.24.21268381 (2021). 16. Bell, J. A. et al. Effects of general and central adiposity on circulating lipoprotein, lipid, and metabolite levels in UK Biobank: a multivariable Mendelian randomization study. Lancet Reg. Health - Eur. 21, 100457 (2022). 17. Fang, S., Holmes, M. V., Gaunt, T. R., Smith, G. D. & Richardson, T. G. Constructing an atlas of associations between polygenic scores from across the human phenome and circulating metabolic biomarkers. eLife. 11, e73951 (2022). 18. Ritchie, S. C. et al. Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants. Sci Data. 10, 64 (2023). 19. Soininen, P., Kangas, A. J., Würtz, P., Suna, T. & Ala-Korpela, M. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circ. Cardiovasc. Genet. 8, 192–206 (2015). 20. Tikkanen, E. et al. Metabolic biomarker discovery for risk of peripheral artery disease compared with coronary artery disease: lipoprotein and metabolite profiling of 31,657 individuals from 5 prospective cohorts. J. Am. Heart Assoc. 10, e021995 (2021). 21. Wittemans, L. B. L. et al. Assessing the causal association of glycine with risk of cardio-metabolic diseases. Nat. Commun. 10, 1060 (2019). 22. Holmes, M. V. et al. Lipids, lipoproteins, and metabolites and risk of myocardial infarction and stroke. J. Am. Coll. Cardiol. 71, 620–632 (2018). 23. Würtz, P. et al. Metabolomic profiling of statin use and genetic inhibition of HMG-CoA reductase. J. Am. Coll. Cardiol. 67, 1200–1210 (2016). 24. Sliz, E. et al. Metabolomic consequences of genetic inhibition of PCSK9 compared with statin treatment. Circulation 138, 2499–2512 (2018). 25. Deelen, J. et al. A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nat. Commun. 10, 3346 (2019). Nature Communications | (2023)14:604 https://doi.org/10.1038/s41467-023-36231-7 26. Ritchie, S. C. et al. The biomarker GlycA is associated with chronic inflammation and predicts long-term risk of severe infection. Cell Syst. 1, 293–301 (2015). 27. Kettunen, J. et al. Biomarker glycoprotein acetyls is associated with the risk of a wide spectrum of incident diseases and stratifies mortality risk in angiography patients. Circ. Genom. Precis. Med. 11, e002234 (2018). 28. Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022). 29. Tynkkynen, J. et al. Association of branched‐chain amino acids and other circulating metabolites with risk of incident dementia and Alzheimer’s disease: a prospective study in eight cohorts. Alzheimers Dement. 14, 723–733 (2018). 30. Mayers, J. R. et al. Elevation of circulating branched-chain amino acids is an early event in human pancreatic adenocarcinoma development. Nat. Med. 20, 1193–1198 (2014). 31. Elliott, P. & Peakman, T. C. on behalf of UK Biobank. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. Int. J. Epidemiol. 37, 234–244 (2008). Acknowledgements The authors are grateful to UK Biobank (Project #30418) and THL Biobank (project #BB2016_86) for access to data to undertake this study. The authors thank all biobank participants for their generous contribution to generating this resource for the scientific community. The work was funded by Nightingale Health Plc. Author contributions H.J., A.C., A.J.K., P.S., J.B., P.W. designed research; H.J. and A.C. contributed to statistical analyses and interpretation of results; M.T., H.K., K.N., V.M., J.N.-K., P.S., A.J.K. contributed to biomarker measurements and quality control; M.P., V.S., P.J., A.L., and K.K. contributed data or results for replication; H.J., A.C., J.B., and P.W. contributed to the interpretation of results and wrote the manuscript. All authors reviewed the manuscript. Competing interests H.J., M.T., H.K., K.N., V.M., J.N.-K., A.J.K., P.S., J.B., and P.W. are employees of Nightingale Health Plc, and hold shares or stock options in Nightingale Health Plc. A.C. is former employee of Nightingale Health Plc. V.S. has received an honorarium for consulting from Sanofi and has ongoing research collaboration with Bayer Ltd outside this work. The remaining authors declare no competing interests. Additional information Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41467-023-36231-7. Correspondence and requests for materials should be addressed to Heli Julkunen or Peter Würtz. Peer review information Nature Communications thanks Tom Richardson, Timothy Ebbels, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Reprints and permissions information is available at http://www.nature.com/reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 14 Article https://doi.org/10.1038/s41467-023-36231-7 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/ licenses/by/4.0/. © The Author(s) 2023 Nature Communications | (2023)14:604 15 Publication IV Nightingale Health Biobank Collaborative Group. Metabolomic and genomic prediction of common diseases in 700,217 participants in three national biobanks. Nature Communications, November 2024. © 2024 The Author(s). This is an open access article distributed under the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as appropriate credit is given to the original author(s). 161 Article https://doi.org/10.1038/s41467-024-54357-0 Metabolomic and genomic prediction of common diseases in 700,217 participants in three national biobanks Received: 12 October 2023 Nightingale Health Biobank Collaborative Group* 1234567890():,; 1234567890():,; Accepted: 8 November 2024 Check for updates Identifying individuals at high risk of chronic diseases via easily measured biomarkers could enhance efforts to prevent avoidable illness and death. Using ’omic data can stratify risk for many diseases simultaneously from a single measurement that captures multiple molecular predictors of risk. Here we present nuclear magnetic resonance metabolomics in blood samples from 700,217 participants in three national biobanks. We built metabolomic scores that identify high-risk groups for diseases that cause the most morbidity in high-income countries and show consistent cross-biobank replication of the relative risk of disease for these groups. We show that these metabolomic scores are more strongly associated with disease onset than polygenic scores for most of these diseases. In a subset of 18,709 individuals with metabolomic biomarkers measured at two time points we show that people whose scores change have different risk of disease, suggesting that repeat measurements capture changes both to health status and disease risk possibly due to treatment, lifestyle changes or other factors. Lastly, we assessed the incremental predictive value of metabolomic scores over existing clinical risk scores for multiple diseases and found modest improvements in discrimination for several diseases whose clinical utility, while promising, remains to be determined. Identifying individuals at elevated risk of disease can help guide the use of preventative interventions. For example, in the UK the multivariable QRISK score is used to identify individuals at high risk of cardiovascular disease who should adjust their lifestyle or begin taking cholesterol-lowering or blood pressure-reducing medicine1. This concept of combining multiple measurements or risk factors into a single score has been extended to the use of ‘omic data, such as polygenic scores (PGS)2,3. By adding up the contribution of different genetic variants associated with different diseases, PGS can identify individuals at elevated risk for multiple diseases4 with one measurement (e.g., a GWAS array or genome sequence), and offer complementary information to traditional risk factors5,6. Metabolomic scores, based on adding up the contributions of multiple biomarkers measured from a blood sample, for example via nuclear magnetic resonance spectroscopy7–9, have also been shown to predict many common diseases10,11 including cardiovascular disease, type 2 diabetes12, and all- cause mortality13. Furthermore, since the metabolomic scores may change in response to lifestyle and treatment (in contrast to PGS), they can also track changes in people’s risk profiles. A few studies have suggested complementary value for genetics and metabolomics in cardiovascular disease and type 2 diabetes14,15, but the combined use of these ‘omics-based risk predictors has not yet been evaluated at scale. Here, we generated nuclear magnetic resonance metabolomic biomarker data in blood samples from apparently healthy individuals from three national biobanks with follow-up data on clinical outcomes. We trained risk prediction scores for the 12 leading causes of disabilityadjusted life years (DALYs) in high-income countries. We investigated the relative performance of these metabolomic scores, PGS and clinical scores in different diseases and time scales. We replicated the performance of metabolomic scores across the three biobanks in the study and assessed the value of multiple metabolomic time points in two of the biobanks. *A list of authors and their affiliations appears at the end of the paper. Nature Communications | (2024)15:10092 1 Article https://doi.org/10.1038/s41467-024-54357-0 Results 27 for stroke, 28 for alcoholic liver disease, 29 for chronic obstructive pulmonary disease (COPD), 30 for liver cirrhosis, 31 for myocardial infarction, 33 for diabetes, and 35 for depression (model coefficients for each biomarker in each score are shown in Supplementary Data 3 and Fig. S3). We evaluated the performance of these scores in the other half of the UK Biobank, as well as the Estonian and Finnish THL biobanks. As we quantify the biomarkers in absolute concentration units (e.g., mmol/l), we can directly use the variable coefficients estimated in the UK Biobank to calculate scores in the other two datasets, without normalizing the biomarkers within each study separately. This is distinct from common practice in other ‘omics analyses, where within cohort normalization is essential16,17. Figure S4 shows that we obtain highly similar results with these normalization steps, but we here present results without them to better mimic predicting a new individual’s risk without additional information (e.g., batch corrections, or cohort means and variances). Metabolomic risk prediction across top sources of morbidity Building metabolomic scores. We measured metabolomic biomarkers via nuclear magnetic resonance spectroscopy in blood samples provided at the time of enrollment from 700,217 participants in the UK Biobank, Estonian Biobank (EBB), or Finnish THL Biobank, all with linked comprehensive clinical data (Table 1, Supplementary Data 1). An overview of the study design is shown in Fig. S1. All three biobanks contain adults from Northern European countries, with varying ascertainment, recruitment years, age ranges, and procedures for extracting outcomes from electronic health records (Methods, Fig. S2). We analyzed 12 diseases causing the most morbidity in the WHO European region in 2019 (excluding falls and back pain, Fig. 1, Table 1, Supplementary Data 2), which cause more than one-third of all DALYs. We trained Cox proportional hazards models to predict incidence of each of these diseases in half of the UK Biobank. We included age and sex in all models as fixed covariates and allowed the model to select (via Lasso with tenfold cross-validation) from among 36 metabolomic biomarkers that have been validated in Europe for use in an in vitro diagnostic medical device (Methods). For all but two of the diseases studied, more than half of the biomarkers were included in the scores: 17 for Alzheimer’s disease, 18 for intracerebral hemorrhage, 21 for colon cancer, 24 for lung cancer, 26 for vascular and other dementias, Table 1 | Basic characteristics of the participants in the three national biobanks Biobank UK Biobank Estonian Biobank THL Biobank* Number of participants 477,078 190,785 32,354 Age at blood sample (median, [IQR]) 58.0 [50.0–63.0] 43.0 [31.0–56.0] 51.0 [39.0–61.0] Females (N (%)) 260,253 (54.6) 125,565 (65.8) 17,248 (53.3) Body mass index (kg/m2; median, [IQR]) 26.7 [24.1-29.8] 25.3 [22.4-29.0] 26.2 [23.6-29.4] Smoking prevalence (%) 10.5 18.3 34.7 Cholesterol lowering medication (%) 17.4 10.1 10.7 Follow-up time (median, [IQR]) 11.8 [11.0–12.5] 3.2 [2.9–3.7] 13.8 [8.8–15.2] Recruitment period 2006–2010 2002–2021 1997–2012 Myocardial infarction 2753 441 304 Ischemic stroke 1791 517 281 Intracerebral hemorrhage 468 117 47 Lung cancer 1259 180 48 Type 2 diabetes 3649 916 484 Chronic obstructive pulmonary disease 3716 732 168 Number of incident events 4 years after baseline visit Alzheimer’s disease 180 58 Vascular and other dementia 326 175 Depressive disorders 4921 2774 Alcoholic liver disease 322 109 33 Cirrhosis of the liver 321 69 8 Colon and rectum cancers 1965 293 60 24 See Fig. S2 for age and recruitment year histograms. *See Supplementary Data 1 for characteristics by cohort. Nature Communications | (2024)15:10092 Baseline age and sex minimally adjusted metabolomic scores and incident disease. We stratified the three test sets into one percent bins of the metabolomic score distribution and meta-analyzed the fouryear incidence rates for each disease (Fig. 1A). The risk of incident disease increased with increasing levels of the metabolomic score across all the diseases. As has been observed previously4,10, these curves follow a quantile-logistic function, which rises superexponentially in the tails, making it possible to identify subsets of individuals that are at much higher risk than average. This effect is especially dramatic for the scores that most strongly predict disease, including type 2 diabetes and liver diseases. Figure 1B shows the performance of the scores by comparing the relative risk of incident disease in the 10% of individuals with the highest metabolomic scores (high-risk group, red shaded area, Fig. 1A) to the remaining population. Again, to avoid needing within-cohort scaling factors or thresholds, we used the top 10% boundary from our training data to define this group in the other half of the UK Biobank and the other two biobanks. This means the proportion of individuals in the high-risk group varies across the three biobanks (Supplementary Data 4), but this high-risk group nonetheless had consistently increased risk across diseases (Fig. 1B). Only depression, alcoholic liver disease, lung cancer, and COPD showed significant meta-analysis heterogeneity (Cochran’s Q, p < 0.004 to account for multiple testing). The UK Biobank test set had the highest point estimate of effect size in only 4 of 12 diseases, demonstrating that the scores are capturing generalizable risk factors, rather than overfitting to the UK Biobank. The meta-analysis of the three test sets included hazard ratios of ~10 for two types of liver disease and diabetes, ~4 for COPD and lung cancer, and ~2.5 for myocardial infarction, stroke and vascular dementia, and was statistically significant (fixed-effect meta-analysis Z score test, p < 0.004 adjusted for multiple testing) for all diseases (Fig. 1B). The pattern of association is similar when considering hazard ratios per standard deviation in a continuous model (Fig. S4), and population-wide discrimination, as measured by area under the receiver-operating characteristic curve (AUC), shows consistent, though variable, improvement when adding metabolomic scores to age and sex (Supplementary Data 5). Sensitivity and subgroup analyses of minimally adjusted scores. Having demonstrated that it is possible to construct metabolomic scores that are replicably associated with risk of these diseases, we next sought to use the diverse data available in these biobanks to investigate further properties of metabolomic scores. First, we assessed the performance of scores using all 249 metabolomic biomarkers we measured, rather than the 36 clinically validated biomarkers described above. Only diabetes and COPD showed consistently improved performance using the extended metabolomics (Fig. S5), likely because many of the biomarkers are correlated, and our 2 Article https://doi.org/10.1038/s41467-024-54357-0 A Myocardial infarction Ischemic stroke Intracerebral hemorrhage Lung cancer 0.25% 2.5% 0.9% 0.20% 1.5% 2.0% 0.15% 1.5% 0.6% 1.0% 0.10% 1.0% 0.5% 0.0% 0 25 50 75 0.5% 0.05% 0.0% 0.00% 100 0 Type 2 diabetes 25 50 75 100 0.3% 0.0% 0 Chronic obstructive pulmonary disease 25 50 75 100 0 Alzheimer's disease 25 50 75 100 Vascular and other dementia Incidence (%) 0.5% 3% 10% 0.4% 0.10% 0.3% 2% 5% 0.2% 0.05% 1% 0% 0.1% 0.00% 0% 0 25 50 75 100 0 Depressive disorders 25 50 75 100 0.0% 0 Alcoholic liver disease 2.4% 1.5% 2.0% 1.0% 25 50 75 100 0 Cirrhosis of the liver 25 50 75 100 Colon and rectum cancers 1.00% 1.5% 0.75% 1.0% 1.6% 0.50% 0.5% 0.5% 1.2% 0.25% 0.0% 0 25 50 75 100 0.0% 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 Percentile of metabolomic score B Myocardial infarction Meta−analysis Ischemic stroke UK Biobank THL Biobank Estonian Biobank Intracerebral hemorrhage Lung cancer Type 2 diabetes Chronic obstructive pulmonary disease Alzheimer's disease Vascular and other dementia Depressive disorders Alcoholic liver disease Cirrhosis of the liver Colon and rectum cancers 1 3 5 10 30 Hazard ratio (95% CI), highest risk decile vs. remaining population Fig. 1 | Association between metabolomic scores and disease onset in three national biobanks. A Observed incidence of the 12 diseases divided into one percent bins of the metabolomic score. The observed incidence is shown as a sample size weighted mean of the 4-year incidence in the three biobank cohorts (n = 481,678). Red shading shows the top 10% of the metabolomic score (adjusted for age and sex). Horizontal dashed line shows the population prevalence. B Four-year hazard ratios of metabolomic scores (adjusted for age and sex) comparing the highest 10% to the remaining 90% of the study population for 12 diseases (n = 481,678). The dots represent point estimates (Cox regression estimates for individual cohorts, fixed inverse-variance weighted mean for meta-analysis) and the horizontal error bars denote 95% confidence intervals of the hazard ratio. Source data are provided as a Source Data file. clinically validated subset captures a large fraction of the total available information in most cases. In some cases, such as Alzheimer’s and the liver diseases in the EBB, the extended metabolomic scores do not replicate as well as the simpler scores. Taken together, these results suggest some diseases may benefit from additional molecular measurements, but care must be taken that they do not capture cohortspecific effects which are less transferrable. Second, we sought to understand the extent to which our scores are driven by well-established behavioral risk factors for some of these diseases, in particular tobacco smoking and alcohol consumption. Both lung disease scores show attenuated, but still strong, association when conditioning on pack-years of smoking, suggesting that while they are partly driven by this behavior, they capture additional information beyond the self-reported variables (Fig. S6). The performance Nature Communications | (2024)15:10092 3 Article https://doi.org/10.1038/s41467-024-54357-0 of our lung cancer score is reduced to almost zero in never smokers, whereas our COPD score still has significant prediction in that group. Liver cirrhosis is equally well predicted across ever and never drinkers, and virtually unaffected by an adjustment for daily alcohol units. The adjusted alcoholic liver disease prediction is somewhat reduced but remains very strong (Fig. S6). Third, as we were limited to somewhat coarse ICD-10 based definitions of the diseases we were studying, we examined whether broader or narrower definitions might change our results by investigating cardiovascular disease more closely. The score trained on the narrow outcome of myocardial infarction had a 0.96 correlation with one trained more broadly on ischemic heart disease. This suggests that for deriving the scores the definition of the disease endpoint is not very sensitive, likely because the underlying risk factors are broadly shared. When testing the scores for association in the test datasets, both show a gradient of increasing effect size for more severe outcomes, from unstable angina to first myocardial infarction to subsequent myocardial infarction (Fig. S7), suggesting that the scores may be strongest at predicting severe outcomes. A Metabolomics + PGS PGS Metabolomics Myocardial infarction Ischemic stroke Intracerebral hemorrhage Lung cancer Type 2 diabetes Chronic obstructive pulmonary disease Alzheimer's disease Vascular and other dementia Depressive disorders Alcoholic liver disease Cirrhosis of the liver Colon and rectum cancers 1 B 3 5 10 Hazard ratio (95% CI), highest risk decile vs. remaining population Top decile of PGS and metabolomic score Myocardial infarction Top decile of PGS, bottom 90% of metabolomic score Ischemic stroke Intracerebral hemorrhage 8% 0.6% 4% 6% 3% Cumulative incidence (%) 4% 0% 2 4 6 8 15% 10% 4 6 8 0% 10 2 Alzheimer's disease Chronic obstructive pulmonary disease 5% 1% 0.0% 2 10 12.5% Type 2 diabetes 20% 3% 2% 0.2% 1% 0% Lung cancer 4% 0.4% 2% 2% 30 Bottom 90% of PGS 4 6 8 0% 10 2 Depressive disorders 4 6 8 10 2 Cirrhosis of the liver 4 6 8 10 Colon and rectum cancers 8% 1.00% 3% 10.0% 6% 0.75% 7.5% 2% 2% 4% 0.50% 5.0% 1% 0.0% 0.00% 2 4 6 8 1% 2% 0.25% 2.5% 0% 10 2 4 6 8 10 0% 2 4 6 8 0% 10 2 4 6 8 10 2 4 6 8 10 Follow−up time (years) C Metabolomics Myocardial infarction Ischemic stroke Intracerebral hemorrhage 4 4 4 2 2 2 1 1 2 Hazard ratio (95% CI), highest risk decile vs. remaining population PGS 4 6 8 10 1 2 Lung cancer 4 6 8 10 2 10 20 10 5 10 5 1 1 2 4 6 8 10 4 6 8 10 2 Vascular and other dementia 10 5 5 1 1 6 8 10 8 10 8 10 1 2 Alzheimer's disease 10 4 Chronic obstructive pulmonary disease Type 2 diabetes 4 6 Depressive disorders 4 2 1 2 4 6 8 10 2 Alcoholic liver disease 4 6 8 10 2 Cirrhosis of the liver 50 20 25 10 1 1 4 6 Colon and rectum cancers 4 2 1 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Time to event (years) Fig. 2 | Associations between metabolomic and polygenic scores and disease onset. A Ten-year hazard ratios of metabolomic, polygenic or combined scores comparing the highest risk decile to the remaining study population (UK Biobank test set, n = 242,492). Dots represent Cox regression estimates and horizontal error bars denote 95% confidence intervals of the hazard ratio. B Risk of disease incidence after blood sampling for high genetics risk group stratified by their metabolomic score and for average genetic risk group. Shaded region denotes 95% confidence interval. C Hazard ratios for highest decile of metabolomic or polygenic scores stratified by time to event. Dots and vertical error bars denote Cox regression estimates and 95% confidence intervals per bin, shaded region is 95% confidence interval for a generalized survival model allowing a time-varying effect using natural splines with 2 knots. All scores adjusted for age and sex (n = 242,492). Source data are provided as a Source Data file. Nature Communications | (2024)15:10092 Comparison of hazard ratios for incident disease among metabolomic, genetic, and combined scores We next compared the performance of metabolomic scores to PGS, which have received widespread attention for risk stratification to aid prevention4,18,19. We calculated PGS using variant weights from the PGS Catalog (Supplementary Data 2), only using scores that were built from genome-wide association study (GWAS) data that did not include the biobanks studied here. We again trained models in half the UK Biobank always including age and sex, and using Lasso to select from: (i) only the external PGS, (ii) among the metabolomic biomarkers (as above), or (iii) among both the PGS and the metabolomic biomarkers. PGS were available for 10 of our diseases, and, as expected, the top 10% high-risk groups were at significantly higher risk than the remaining 90% (Fig. 2A). However, the hazard ratio of being in the genetic high-risk group was less than the metabolomic high-risk group in all diseases except colorectal cancer. In most cases, the best performing model included both genetic and metabolomic scores, suggesting that these two data types capture at least partially complementary information. A formal interaction test between metabolomic and genetic scores found a significant effect only for type 2 diabetes (Supplementary Data 6), and the small confidence intervals demonstrated that genetic and metabolomic risk is primarily additive on the log hazard ratio scale. For six diseases we could also calculate PGS in the EBB, which replicated the results in the UK Biobank (Fig. S8). We stratified individuals in the genetic high-risk group by whether they were also in the metabolomic high-risk group (Fig. 2B). Individuals at high risk by both PRS and metabolomics scores are indeed at very elevated risk, but genetically predisposed individuals not in the high metabolomic risk group have risk similar to (or in some cases less than) those not in the genetically predisposed group. This is likely because current PGS capture less than half of the genetic risk for these diseases, and that unexplained heritability, combined with lifestyle and environmental history, is partially reflected in the metabolomic score. The metabolomic and genetic scores also have different patterns of correlation between different diseases (Fig. S9). As has been previously shown, the PGS for different diseases tend to be largely uncorrelated19, whereas the different metabolomic scores are nearly all correlated with each other, reflecting multi-morbidity10,13. Combining the two types of information can yield both improved performance and specificity of risk stratification. While the three biobanks are dominated by individuals of European ancestry, we did compare the transferability of the metabolomic scores and PGS for 8 endpoints with at least 35 events in multiple ancestries in the UK Biobank (Fig. S10). The metabolomic scores remained significantly predictive across disease-ancestry 4 Article https://doi.org/10.1038/s41467-024-54357-0 Stayed in top decile of metabolomic score Myocardial infarction Left top decile Joined top decile Ischemic stroke Everyone else Lung cancer Type 2 diabetes 20% 5% 4% 4% 15% 4% 3% 3% 3% 10% 2% 2% 1% 1% Cumulative incidence (%) 2% 1% 0% 2 4 6 0% 8 Chronic obstructive pulmonary disease 2 4 6 8 0% Vascular and other dementia 5% 2 4 6 8 0% 2 4 6 8 Colon and rectum cancers Depressive disorders 12.5% 5% 10.0% 2% 4% 7.5% 1% 5.0% 2% 1% 1% 2.5% 0.0% 2% 3% 0% 2 4 6 8 2 4 6 8 0% 2 4 6 8 0% 2 4 6 8 Follow−up time after repeat sample (years) Fig. 3 | Risk of disease onset for eight diseases stratified by metabolomics scores at two-time points. Maroon lines show those who were in the high-risk group at both time points ~5 years apart, green lines show those who were in the high-risk group at enrollment but had left it by the second time point, orange lines show those who were in the low risk group at baseline and moved to the high risk group at follow-up, and black lines show those who were in the low-risk group at both time points. Data from 18,709 UK Biobank participants with metabolomic scores from two time points. Shaded areas are 95% confidence intervals, derived from the standard error of the cumulative hazard. Source data are provided as a Source Data file. combinations, though often with weaker effect size estimates than in the European ancestry group. As has been previously shown, the effect sizes of PGS were also attenuated in non-European ancestries, and because they perform worse than metabolomics in Europeans, the estimate was statistically significant in only 3 out of 19 non-European comparisons. For the metabolomic scores, 12 out of 19 were significant. This shows how more diverse datasets will be essential not just to produce transferrable polygenic risk scores, but more generally to produce multi-omic scores that are as widely useful as possible. The longer follow-up time in the UK Biobank also allowed us to compare short-term and long-term prediction from these scores. As expected, since PGS are fixed throughout life, their hazard ratios remained constant over follow-up time (Fig. 2C). The relationship between hazard ratio for the metabolomic score and time to event varied by disease: diabetes, lung cancer, vascular dementia and alcoholic liver disease scores provide stronger stratification of near-term risk, but for most diseases the metabolomic scores were stable over time, like PGS. the repeat visit, so we fitted a joint risk model with baseline and followup metabolomic score measurements. For diabetes (Cox regression, baseline HRb = 2.52, 95% CI = 2.24–2.83, pb = 2.45 × 10−54, follow-up HRf = 1.57, 95% CI = 1.39–1.77, pf = 1.10 × 10−13) and COPD (HRb = 1.52, 95% CI = 1.33–1.72, pb = 2.2 × 10−10, HRf = 1.31, 95% CI = 1.31–1.70, pf = 1.12 × 10−9) both time points were significantly associated with 10year risk; for all other diseases except vascular and other dementias the hazard ratio point estimates were all consistently positive but were not significant due to weaker prediction from the scores and smaller sample size. This suggests that both a person’s current metabolomic score value, as well as previously measured score values, contribute information about risk of disease onset. To further explore this idea, we considered individuals in the top 10% high-risk groups at the first time point and compared the subset of that group who remained in the high-risk group at the follow-up time point to those who had left it. For diabetes, leaving the high-risk group showed a significant reduction in risk (Cox regression, HR = 2.58, 95% CI = 1.74–3.84, p = 2.7 × 10−6), after adjusting for baseline score (Fig. 3). For lung cancer and COPD, risks were reduced fivefold (HR = 4.96, 95% CI = 1.61-15-23, p = 5.2 × 10−3) and 1.9-fold (HR = 1.92, 95% CI = 1.21–3.06, p = 5.8 × 10−3) respectively, but estimates were no longer significant after multiple testing correction. We replicated this analysis in 5038 individuals from the EBB for whom we also profiled a second timepoint from blood samples donated approximately five years after the baseline survey. We observed the same effect for type 2 diabetes, which was the only disease for which we had sufficient cases to test (HR = 4.4, p = 0.002). While we do not know what caused individual metabolomic scores to change between time points in these observational cohorts, we can assess what differences in lifestyle factors are associated with changes in metabolomic scores. For example, obese individuals who stayed in the high-risk group for diabetes gained an average 0.18 units Incident disease risk among participants with two metabolic scores We generated metabolomic profiles at a second-time point from blood samples donated by 18,709 UK Biobank participants who returned for a repeat visit approximately four and a half years after they initially enrolled in the study (median time difference 4.4 years, mean 4.3 years, range 2.1–6.9 years). The correlations of the scores range from 0.42 for Alzheimer’s disease to 0.71 for diabetes and fall in the middle of the range of correlations for individual biomarkers (e.g., amino acids ~0.3, HDL cholesterol ~0.8) (Supplementary Data 7, Fig. S11). For eight diseases (myocardial infarction, ischemic stroke, diabetes, COPD, depression, colorectal cancer, lung cancer, and vascular and other dementias) at least 100 events occurred within 10 years of Nature Communications | (2024)15:10092 5 Article of body mass index (BMI), but those who changed from high to low risk lost an average of 0.81 units of BMI (difference of 0.99, 95% CI 0.78–1.20, linear regression t(df = 1348) = 9.22, p = 1.08 × 10−19). Among self-reported smokers who were in the high-risk group for COPD at the first time point, 64% of those who continued smoking remained at high risk, compared to just 40% of those who reported quitting between the two time points (Fisher’s exact test OR = 2.72, 95% CI = 2.47–6.68, p = 0.0055). However, these explained only a few percent of the observed metabolomic score changes, demonstrating that the scores integrate a wider range of information than questionnaires. Comparing multi-omics to existing clinical risk scores We next compared our multi-omics predictions to published clinical risk scores both in terms of hazard ratios for the top decile and population-wide AUC. For all our diseases except Alzheimer’s we identified scores that are recommended for use either by the National Health Service in England and Wales (NHS) or by professional bodies in the UK, EU, or USA, and calculated them as accurately as possible using available variables in the UK Biobank (Methods). These scores vary in the types of variables they include (e.g., QRISK for cardiovascular disease includes several blood and anthropometric measurements, whereas PHQ2 for depression is based solely on two self-reported questions). The multi-omics scores perform significantly better for 10year risk prediction of myocardial infarction, the two liver diseases, and colon cancer, while the clinical scores perform significantly better for lung cancer, diabetes, and depression, and the remaining four either have no difference, or inconsistent results between AUC and hazard ratio of the top decile (Fig. 4A, Table 2). For all diseases except intracerebral hemorrhage (which is not well predicted beyond age and sex by any score we tested) a combined clinical+multi-omic score has significantly higher AUC than the clinical score, with increases ranging from 0.006 for lung cancer to 0.118 for alcoholic liver disease (Table 2). These results were also consistent for four years of follow-up (Supplementary Data 8, Fig. S12). To illustrate how the multi-omics score could augment the most widely used risk screening tool in the UK, we further focused on comparing QRISK to QRISK+multi-omics in individuals not using statins at baseline, which approximates the eligible group for a common use of these scores in prioritizing patients for statin treatment. The AUCs improve by 0.029 for myocardial infarction and 0.008 for stroke, which are equivalent to the whole-population values in Table 2 (i.e., there is no interaction between these scores and statin usage). Considering net reclassification index, the continuous changes (i.e., net fraction of individuals whose risk score moves in the right direction) are substantial: 0.21 myocardial infarction events (NRI+) and 0.31 for MI non-events (NRI–) and 0.02 and 0.18 for stroke. Considered categorically (i.e., moving in the right direction between high and lowrisk groups) they are NRI+ = 0.09 and NRI– = −0.01 for MI and NRI+ = 0.02 and NRI– = 0.00 for stroke. Finally, we compared our high-risk groups to the remainder of the population using the Frailty Index20,21 as a surrogate for an overall impression of the health of an individual. The high-risk group has slightly higher frailty index values (Fig. S13) and generally different clinical characteristics (Supplementary Data 9). Calibration across studies from different countries Risk prediction models should be evaluated based on both discrimination and calibration22. We therefore tested the calibration of our metabolomic scores by plotting observed event rates against predicted absolute event rates per decile in all three biobanks (Fig. 4B). We estimated calibration slopes and intercepts by fitting a logistic regression model of observed risk on predicted risk, without any study-specific processing or normalization, mimicking real-world patient usage (Methods). For the main calibration analysis, we included diseases with >200 events over 3 years, as recommended by Nature Communications | (2024)15:10092 https://doi.org/10.1038/s41467-024-54357-0 earlier studies23,24. Calibration results for the remaining diseases are shown in Fig. S14. Overall, the metabolomic scores demonstrated good calibration. In the UK Biobank test set the calibration slopes ranged from 0.95 to 1.24 across diseases, as expected since the models were trained in the other half of this biobank. In the EBB, the calibration slopes ranged from 0.76 to 1.16, except for depression at 0.42. This difference is likely a result of diagnostic differences in depression in different countries, as well as how those diagnoses are encoded in electronic records. In the Finnish THL Biobank the slopes were 1.03 (ischemic stroke), 1.20 (myocardial infarction) and 1.21 (diabetes), though the absolute rates varied considerably, likely reflecting different rates of these diseases in the earlier recruitment waves of these cohorts. Discussion We have shown that metabolomic scores can identify individuals at increased risk across a range of diseases, consistent with a previous report that analyzed a subset of the data described here10. We have replicated these findings in two additional national biobanks and demonstrated consistent performance in three countries which have varying sample collection types (e.g., plasma vs. serum), enrollment criteria, fasting protocols, and electronic health record systems. As the biomarkers that constitute these scores are measured in absolute units, the scores can be computed without cohort-specific rescaling, which could aid in clinical translation of the scores. More than 1 in 4 individuals in these biobanks are in the high-risk group for at least one of the cardiovascular diseases, lung diseases, liver diseases or diabetes, where the high-risk group has at least 2.5-fold increase in risk (likely an underestimate of population levels, due to healthy volunteer bias). Our direct comparison, and combination, of metabolomic and polygenic risk factors suggests the value of multi-omic scores. When good predictors exist from both ‘omics data types (e.g., myocardial infarction, diabetes, colorectal cancer) the scores are complementary and together provide an improved combination of predictive accuracy and specificity. Combining these forms of information may also be useful in maximizing the predictive accuracy while avoiding the perceived determinism25 of fixed genetic predictions: our data suggests both that genetic factors contribute stable long-term predictive information and that these can be outweighed, particularly in the shorter term, by detectable differences in metabolomic risk profile (driven, in part, by factors modifiable by lifestyle and treatment changes). Our analysis of follow-up samples underscores this: the scores show an intermediate level of stability five years apart, meaning they provide long-term risk prediction and can also track measurable change in risk in response to lifestyle changes or treatment. The observational nature of this dataset, as well potential survivor bias in the individuals who participated in the repeat sampling, limit our ability to make causal inferences about associations between changes in lifestyle factors and changes in scores. Future studies that explicitly measure metabolomics before and after an intervention are needed to fully explore how metabolic risk scores track changes in modifiable risk. For 11 of the diseases we studied, we compared to existing scores used for risk prediction or screening. We identified clear examples where multi-omic scores alone outperformed these clinical scores, including for myocardial infarction and liver diseases. In other diseases the clinical scores perform better, often for understandable reasons. For instance, for type 2 diabetes the QDiabetes score we used includes Hb1Ac, the gold-standard diagnostic biomarker for the disease. For lung cancer, the clinical score directly includes the vital causal risk factor of smoking, and yet the multi-omic score still provides significant additional information: if we restrict to current smokers, the multi-omic score outperforms the clinical score for lung cancer (hazard ratio of 2.2 vs 1.8). Multi-omic scores provide statistically significant improvements when added on top of existing scores in all but 6 Article https://doi.org/10.1038/s41467-024-54357-0 A Myocardial infarction (QRISK3) Multi−omics + clinical Multi−omics Clinical Ischemic stroke (QRISK3) Intracerebral hemorrhage (QRISK3) Lung cancer (LLP v2 without asbestos) Type 2 diabetes (QDiabetes) Chronic obstructive pulmonary disease (COPD−PS) Vascular and other dementia (QRISK3) Depressive disorders (PHQ−2) Alcoholic liver disease (AUDIT−C) Cirrhosis of the liver (FLI) Colon and rectum cancers (QCancer) 1 B 3 5 10 Hazard ratio (95% CI), highest risk decile vs. remaining population 30 UK Biobank Estonian Biobank THL Biobank, pooled 3 2 1 0 1 2 3 0 1 2 3 0 1 2 0 2 4 6 0 2 4 0.5 0.0 0.0 0.5 1.0 0.0 0.5 1.0 Chronic obstructive pulmonary disease Observed event rate (%) Predicted event rate (%) Myocardial infarction 0 3 3 1.0 6 1 2 2 Predicted event rate (%) 2 1 1 Colon and rectum cancers 3 0 3 Type 2 diabetes 0 6 2 Predicted event rate (%) 2 4 1 1 4 2 2 3 6 0 3 Predicted event rate (%) Observed event rate (%) Observed event rate (%) 0 Observed event rate (%) Depressive disorders Observed event rate (%) Observed event rate (%) Ischemic stroke 0 1 2 3 0 1 2 Predicted event rate (%) Fig. 4 | Predictive ability and calibration of models including clinical, polygenic and/or metabolomic scores. A Ten-year hazard ratios of models with clinical variables compared to those variables plus the best ‘omic data (either just metabolomics or metabolomics plus PGS from Fig. 2, n = 241,750). Dots represent Cox regression estimates and horizontal error bars denote 95% confidence intervals of the hazard ratio. B For each disease, the calibration of three-year observed event Nature Communications | (2024)15:10092 3 2 1 0 0 1 2 0 1 2 Predicted event rate (%) rates are shown by 10 equally sized deciles of absolute risk predicted by a metabolomic score adjusted for age and sex (n = 415,592). Dots represent means and vertical lines represent 95% confidence intervals of the observed event rate. Calibration slopes and intercepts were derived from a logistic regression of the observed risk on the predicted risk. Source data are provided as a Source Data file. 7 Nature Communications | (2024)15:10092 0.75 (0.72–0.77) 0.75 (0.72–0.77) 0.68 (0.67–0.69) Alcoholic liver disease (AUDIT-C) Cirrhosis of the liver (FLI) Colon and rectum cancers (QCancer) 0.69 (0.68–0.70) 0.01 (p = 1.3e–06) 0.62 (0.60–0.64) −0.12 (p = 5.4e–19) 0.57 (0.56–0.57) −0.10 (p = 6.5e–115) 0.66 (0.65–0.66) −0.21 (p < 5e–324) 0.70 (0.69–0.72) −0.12 (p = 1.4e–85) 0.70 (0.68–0.72) −0.01 (p = 5.0e–05) 0.71 (0.70–0.72) −0.03 (p = 7.9e–29) 0.68 (0.67–0.69) −0.003 (p = 2.6e–01) 0.80 (0.78–0.82) 0.05 (p = 3.1e–05) 0.87 (0.85–0.88) 0.12 (p = 5.1e–19) 0.61 (0.60–0.62) −0.06 (p = 1.9e–39) 0.82 (0.81–0.83) 0.000 (p = 0.84) 0.82 (0.81–0.83) 0.76 (0.75–0.77) −0.05 (p = 1.7e–19) 0.81 (0.80–0.81) −0.06 (p = 4.3e–102) 0.75 (0.74–0.76) −0.08 (p = 2.8e–42) 0.71 (0.69–0.72) −0.008 (p = 0.041) 0.73 (0.72–0.73) −0.01 (p = 3.7e–08) 0.74 (0.74–0.75) −0.004 (p = 6.9e–02) Metabolomics 0.70 (0.69–0.71) 0.02 (p = 1.3e–09) 0.80 (0.78–0.82) 0.05 (p = 2.1e–05) 0.61 (0.61–0.62) −0.05 (p = 7.5e–33) 0.83 (0.82–0.84) 0.77 (0.76–0.78) −0.04 (p = 6.7e–12) 0.82 (0.81–0.82) −0.05 (p = 9.6e–78) 0.75 (0.74–0.77) −0.07 (p = 2.1e–36) 0.71 (0.69–0.72) −0.008 (p = 0.034) 0.73 (0.72–0.74) −0.01 (p = 7.6e–05) 0.76 (0.76–0.77) 0.01 (p = 3.6e–08) PGS + Metabolomics Clinical + PGS 0.70 (0.69–0.71) 0.02 (p = 6.3e–20) 0.75 (0.73–0.77) 0.004 (p = 0.11) 0.67 (0.67–0.68) 0.006 (p = 3.1e–05) 0.81 (0.80–0.82) 0.006 (p = 9.9e–07) 0.87 (0.87–0.88) 0.003 (p = 2.7e–17) 0.83 (0.82–0.84) 0.004 (p = 2.7e–04) 0.71 (0.70–0.73) 0.000 (p = 0.36) 0.74 (0.73–0.75) 0.003 (p = 5.9e–05) 0.77 (0.76–0.77) 0.02 (p = 2.0e–27) 0.68 (0.67–0.69) 0.004 (p = 8.7e–03) 0.84 (0.82–0.86) 0.10 (p = 2.5e–23) 0.86 (0.84–0.88) 0.12 (p = 2.4e–24) 0.68 (0.68–0.69) 0.02 (p = 5.2e–12) 0.82 (0.81–0.83) 0.006 (p = 8.5e–04) 0.83 (0.82–0.84) 0.02 (p = 2.0e–21) 0.89 (0.88–0.89) 0.02 (p = 2.4e–69) 0.83 (0.82–0.84) 0.004 (p = 0.017) 0.72 (0.70–0.74) 0.004 (p = 0.11) 0.75 (0.74–0.75) 0.008 (p = 1.8e–07) 0.76 (0.76–0.77) 0.01 (p = 5.9e–26) Clinical + Metabolomics 0.70 (0.69–0.71) 0.03 (p = 2.0e–19) 0.84 (0.82–0.86) 0.10 (p = 1.9e–23) 0.69 (0.68–0.69) 0.02 (p = 9.2e–15) 0.83 (0.82–0.84) 0.03 (p = 2.7e–27) 0.89 (0.88–0.89) 0.02 (p = 2.5e–84) 0.83 (0.82–0.84) 0.006 (p = 3.5e–04) 0.72 (0.70–0.74) 0.004 (p = 0.16) 0.75 (0.74–0.76) 0.01 (p = 9.6e–10) 0.78 (0.77–0.78) 0.03 (p = 2.6e–51) Clinical + ‘omics First row shows the area under receiver operating curve (95% Confidence Interval) for the scores. Second row shows AUC difference in comparison to the clinical score (p-value). Two-tailed p-values were calculated using DeLong’s method, not adjusting for multiple testing. 0.67 (0.66–0.68) Depressive disorders (PHQ-2) 0.82 (0.81–0.84) 0.80 (0.80–0.81) COPD (COPD-PS) – 0.87 (0.86–0.87) Type 2 diabetes (QDiabetes) 0.82 (0.81–0.83) 0.82 (0.81–0.84) Lung cancer (LLP v2 without asbestos) Alzheimer’s disease (NA) 0.72 (0.70–0.73) Intracerebral hemorrhage (QRISK3) Vascular and other dementia (QRISK3) 0.70 (0.69–0.72) −0.10 (p = 1.4e–81) 0.74 (0.73–0.75) Ischemic stroke (QRISK3) 0.73 (0.72–0.74) −0.02 (p = 1.8e–11) 0.75 (0.74–0.76) Myocardial infarction (QRISK3) PGS Clinical Disease Table 2 | Comparison of area under receiver operating curves in clinical and ‘omics scores in UK Biobank over ten years of follow-up Article https://doi.org/10.1038/s41467-024-54357-0 8 Article one disease. Future work will be needed to quantify the extent to which these statistically significant improvements in prediction, when used to guide health interventions, could translate into improvements in population health, as has been recently studied in the specific case of cardiovascular disease26. Furthermore, it will be important to consider whether and how multi-omic data fit in the diverse clinical contexts represented by these scores. QRISK3 and QDiabetes are mainly tools for identifying high-risk individuals who would benefit from primary prevention efforts, COPD-PS, PHQ2, AUDIT and Fatty Liver Index (FLI) are screening tools for prioritizing further investigation in individuals who may have undiagnosed disease, and QCancer and Liverpool Lung Project (LLP) scores are used for both primary prevention and to aid early detection. In addition to accuracy, our results also allow us to test calibration across populations, which is vital to future applied use. While imperfect, the calibration of our scores is comparable to widely used tools like the pooled cohort equations for cardiovascular risk when compared, for example between the US and Canada27. Our study has several limitations which we could only partially mitigate. First, although biobanks are powerful for studying many diseases simultaneously, they typically have less detailed and wellcurated phenotypes than disease-specific clinical cohorts. For example, we use endpoint definitions based on broad categories of ICD-10 codes. In the case of cardiovascular disease, we have shown using more fine-grained coding does not substantially alter our conclusions, but clinically focused collections will help bring additional resolution. Furthermore, biobanks are known to have healthy volunteer bias, which can affect comparisons between these results and true rates of disease in the overall population. Second, we here defined those groups as the top 10% of each disease’s risk score, but in practice different cut points may be appropriate for different diseases (e.g., in liver disease, risk is strongly concentrated in the top 1%), or cut-offs may not be necessary at all if the goal is to deliver improved prediction across the full range of baseline risk in the population. Third, these three biobanks are nearly all of European ancestry, and while we showed some promising results in the non-European ancestry subset of UK Biobank, analyses of more diverse cohorts will be essential. Finally, in comparing against clinical screening there are certain variables not measured in these biobanks that preclude a complete comparison, including of some widely used low-cost tests (e.g., fecal occult blood samples in colon cancer) and more intensive screening tools (e.g., CT screening for lung cancer or colonoscopies for colon cancer). Many healthcare systems desire a more personalized and preventative model of disease management, including longitudinal monitoring of risk, early detection of disease, and active, patientcentered management of risk factors. For this reason, it will be valuable to understand how classical risk factors can be supplemented or replaced with newer predictors including metabolomics and genetics, and how and why these risk predictions change over time. We believe our results, and the extremely large dataset that underlies them, are an important part of this puzzle. Future work will need to focus on how we may incorporate additional new predictors to stratify disease risk further (e.g., recent work on proteomics has shown promise28), as well as how this next generation of prediction algorithms, described here for research use, can be validated, made actionable and delivered to patients. Methods Study populations and endpoint definitions We used data from a total of 700,217 individuals from three biobanks, UK Biobank (N = 477,078), EBB (N = 190,785) and Finnish THL Biobank (N = 32,354). Figure S1 shows an overview of our study design and Table 1 shows summaries of participant characteristics. The UK Biobank is a longitudinal biomedical study of approximately half a million participants between 38 and 71 years old from the United Kingdom29. Participant recruitment was conducted on a Nature Communications | (2024)15:10092 https://doi.org/10.1038/s41467-024-54357-0 volunteer basis and took place between 2006 and 2010. Initial data were collected in 22 different assessment centers throughout Scotland, England, and Wales. Data collection includes elaborate genotype, environmental and lifestyle data. Blood samples were drawn at baseline for all participants, with an average of four hours since the last meal, i.e., generally non-fasting. Nuclear magnetic resonance (NMR) metabolomic biomarkers (Nightingale Health, quantification library 2020) were measured from EDTA plasma samples (100 μL, aliquot 3) during 2019–2023 for the entire cohort. In addition, plasma samples were measured by NMR metabolomics from ~20,000 participants who underwent a repeat-visit assessment on average five years after the baseline visit. The NMR protocol is known to perform well on blind duplicate samples in the UK Biobank8. Follow-up data include a wide range of electronic health-related records, including disease incidence, hospital admissions, primary care, and death records, which are presently still regularly updated. The UK Biobank study was approved by the North West Multi-Centre Research Ethics Committee. This research was conducted using the UK Biobank Resource under Application Number 30418. The EBB is a curated population-based biobank of Estonia, comprising a cohort of ~210,000 volunteers30. The participants constitute about 20% of the adult Estonian population and the cohort is approximately representative of the nation in terms of age, sex, and geographic dispersion. The enrollment was conducted between 2002 and 2022. A network comprising general practitioners and various medical personnel from private practices, hospitals, and recruitment offices of the Estonian Genome Center was established for participant recruitment, as well as for collection of samples and health data. After recruitment, participants were asked to fill out detailed questionnaires, which encompassed personal information, genealogical data, educational and occupational history, as well as lifestyle habits. Blood samples were generally collected non-fasting31. NMR metabolomic measurements were conducted on EDTA plasma samples (100 μL) for all biobank participants. The EBB database undergoes regular synchronization with several national registries and hospital databases, along with the national health insurance fund’s database that houses comprehensive treatment and service bill information. Disease events are codified in compliance with the ICD-10 standards and medication usage is categorized as per the Anatomical Therapeutic Chemical classification, both with current follow-up data available until the end of 2021. The Estonian Committee on Bioethics and Human Research approved the study. Data was accessed with research approval number 1.1-12/2770. The Finnish THL biobank data consist of five population cohorts (National FINRISK Studies 1997, 2002, 2007, 2012 and Health 2000 Survey) collected in study specific years between 1997 and 201232,33. Each of the five cohorts is an independent random sample drawn of unique individuals aged 25–98 (25–74 in FINRISK, 30 and over in Health 2000). Recruitment was conducted via invitation-only in multiple urban and rural areas across Finland to be representative of the nation (participation rate 60–70%). The baseline surveys included a wide range of health-related questionnaire and biological measures, including a non-fasting blood sample (median 5 h since last meal) from ~85% of all participants enrolled. NMR metabolomic data were measured from all participants with blood samples available. In contrast to UK and EBBs, NMR metabolomics measurements were done on serum samples (350 μL)8,32. Information on disease outcomes were linked from national hospital discharge registries and reimbursement records with follow-up until 2017 (4 to 19 years of follow-up). The THL biobank cohorts were approved by the Coordinating Ethical Committee of the Helsinki and Uusimaa Hospital District, Finland. Data was accessed with research application number BB2016_86. In our main analysis, we included 12 diseases that are the top causes of disease-adjusted life years (DALYs) in the European region in 2019 according to the WHO, except falls and back pain34: myocardial 9 Article infarction, ischemic stroke, intracerebral hemorrhage, lung cancer, type 2 diabetes, COPD, Alzheimer disease, vascular and other dementias, depressive disorders, alcoholic liver disease, cirrhosis of the liver, and colon and rectum cancers. Some WHO groupings were coarse (e.g., all dementias together), so we used narrower definitions in those cases to create more biologically homogenous outcomes aligned with common disease definitions used for PGS development. Instead of using ischemic heart disease and diabetes mellitus, we specified to myocardial infarction and type 2 diabetes, stroke was split into ischemic stroke and intracerebral hemorrhage, Alzheimer’s disease was separated from vascular and other dementias, and alcoholic liver disease was separated from cirrhosis of the liver (Supplementary Data 2). We also included a model for the coarser ischemic heart disease (ICD-10 codes I20-25) endpoint, which we compared to the myocardial infarction model in a supplementary analysis. Disease incidence was defined based on the first occurrence of ICD-10 codes listed in Supplementary Data 2. In the UK Biobank, we based disease incidence on primary care data (only available in ~45% of UK Biobank participants), hospital inpatient data, death register records, cancer registry data, and self-reports at baseline. In the EBB, we based disease incidence on self-reports at baseline, E-Health, North Estonia Medical Center, Tartu University Hospital, death registry records and cancer registry data. The Estonian Health Insurance Fund was excluded as a source because it appeared that diagnoses exclusively from that source were less severe. In the Finnish THL Biobank, we based disease incidence on the nation-wide hospital discharge registry (HILMO), cause-of-death registry, and for certain diseases, medication reimbursement registry. We considered disease cases occurring after the blood draw at study baseline as incidence cases. Cases that occurred prior to the baseline blood draw were considered prevalent cases and were excluded from the analysis in a diseasespecific manner. Analyses in this paper were all carried out conditional on age and sex. Sex was defined differently in the different cohorts: in UK Biobank, sex at recruitment is taken from the patient’s medical record but can subsequently be edited by the participant if they choose. For the Finnish THL biobank, sex was taken from the population registries held by the Digital and Population Data Services Agency. In the EBB, sex was extracted from the Estonian National Identity number which is created based on the sex recorded in the Estonian Birth Registry. Written informed consent was obtained from all participants. Participants were not offered compensation for participating in this study. Metabolomic biomarker profiling Lipid and metabolite biomarkers were quantified from 757,927 blood samples by high-throughput NMR metabolomics (Nightingale Health Plc). This number includes 23,080 blinded duplicate samples from the UK Biobank for quality control purposes. Additionally, each 96 wellplate contained two internal control samples. We used the sample handling and measurement protocol established and validated for the first phase of metabolomics in UK Biobank8. The measurement protocol was similar for EBB and Finnish THL Biobank32. Briefly, EDTA samples of at least 90 μL were plated onto 96-well plates at UK Biobank laboratory (Stockport, UK) and shipped on dry ice in batches of 5000–20,000 samples to Nightingale Health Laboratories in Finland. Samples were provided to the Nightingale Health lab using randomly assigned pseudonymous sample IDs and were analysed in three tranches blinded to the clinical phenotype of the sample, with UK Biobank only revealing the linkage from pseudonymous sample IDs to true sample IDs after data generated for that tranche was complete. Samples were thawed overnight at +4 °C, mixed and centrifuged, transferred to NMR tubes and mixed in 1:1 ratio with a phosphate buffer (75 mM Na2HPO4 in 80%/20% H2O/D2O, pH 7.4, including also 0.08% sodium 3-(trimethylsilyl) propionate-2,2,3,3-d4 and 0.04% sodium Nature Communications | (2024)15:10092 https://doi.org/10.1038/s41467-024-54357-0 azide). The samples were profiled using a total of nine 500 MHz spectrometers (Bruker AVANCE IIIHD). Two NMR spectra are recorded (a presaturated proton spectrum and a Carr–Purcell–Meiboom–Gill T2-relaxation-filtered spectrum) and proprietary software is used for biomarker quantification in absolute units (Nightingale Health, quantification library 2020)7,8. This provides 249 biomarker measures in a single assay (168 absolute and 81 ratio measures), including routine lipids, lipoprotein profiling of 14 size subclasses, fatty acids, and various low-molecular weight metabolites, such as amino acids, ketones, and glycolysis metabolites as well as two inflammatory protein measures, albumin, and glycoprotein acetyls. For risk model training we used 36 clinically validated biomarkers with CE mark in the NMR metabolomics assay to facilitate rapid translation and clinical applications for model training (Total cholesterol, VLDL cholesterol, Clinical LDL cholesterol, HDL cholesterol, Total triglycerides, Apolipoprotein B, Apolipoprotein A1, Ratio of apolipoprotein B to apolipoprotein A1, Total fatty acids, Omega-3 fatty acids, Omega-6 fatty acids, Polyunsaturated fatty acids, Monounsaturated fatty acids, Saturated fatty acids, Docosahexaenoic acid, Ratio of omega-3 fatty acids to total fatty acids, Ratio of omega-6 fatty acids to total fatty acids, Ratio of polyunsaturated fatty acids to total fatty acids, Ratio of monounsaturated fatty acids to total fatty acids, Ratio of saturated fatty acids to total fatty acids, Ratio of docosahexaenoic acid to total fatty acids, Ratio of polyunsaturated fatty acids to monounsaturated fatty acids, Ratio of omega-6 fatty acids to omega-3 fatty acids, Alanine, Glycine, Histidine, Total concentration of branched-chain amino acids, Isoleucine, Leucine, Valine, Phenylalanine, Tyrosine, Glucose, Creatinine, Albumin, Glycoprotein acetyls)8. To account for potential glucose degradation prior to plasma sample preparation, we used an estimate of physiological glucose concentration based on observed glucose and lactate as input for the risk models. Further, to correct for spectrometer differences in alanine concentration, the single metabolite most impacted by technical variation35, we shifted mean alanine concentrations observed within each spectrometer in each biobank to the mean and standard deviation of a master spectrometer. The average biomarker detection rate was >99% across the plasma samples. Further details on the individual biomarker measures are provided in the UK Biobank data resource. Genotype data and polygenic scores For this study, genotype data was available for UK Biobank and EBB, but not for THL Biobank. The UK Biobank participant have been genotyped on Applied Biosystems UK Biobank Axiom Array and UK BiLEVE Axiom Array, measuring over 800,000 variants and imputed using the Haplotype Reference Consortium and UK10K and 1000 Genomes reference panels outside of this study29. EBB participants have been genotyped with genome-wide chip arrays and further imputed with a population-specific imputation panel consisting of high-coverage (30fold) whole-genome sequence data from 2244 individuals and over 16 million high-quality genetic variants36. For 10 of the 12 diseases, we used an existing PGS from the PGS Catalog37 that was developed using GWAS summary statistics that did not include the UK Biobank in their discovery cohort (Supplementary Data 2). These PGS were computed for UK Biobank participants as the weighted sum of risk alleles using imputed genotype data. We were also able to compute six of these PGS in EBB, though we note that a small number (~8000) of EBB samples were included in the GWAS underlying the PGS of diabetes and myocardial infarction. For UK Biobank, we estimated participant’s genetic ancestry with respect to the five superpopulations of the 1000 Genomes Project38 using principal component analysis projection and a random forest classifier, and scaled PGS with respect to their estimated ancestry. For EBB we assumed more homogenous genetic background and scaled PGS within the cohort. 10 Article Clinical scores A disease-specific clinical risk score was chosen for each endpoint to act as a comparator for benchmarking our risk scores. The risk scores were chosen such that they could be fully or partially calculated in UK Biobank participants using data available at baseline, and were taken from national risk assessment, screening and diagnosis guidance from NHS England and Wales where available, and from recommendations from other government or professional bodies where not available. Two risk scores were modified to account for data not available in UK Biobank at baseline (AUDIT-C and COPD-PS), and one was modified to remove a question that showed informative missingness in our dataset (LLPV2). We did not identify a widely used clinical risk score for Alzheimer’s disease, so did not carry out benchmarking for this endpoint. Justifications for the choice of risk scores and details of their implementation are given in the Supplementary Methods document, and details, including code lists, of how each variable is derived from UK Biobank data is given in Supplementary Data 10. We defined high-risk individuals on the basis of the clinical scores as those who were in the top decile of risk, after adjusting for age and sex, in the training set. For some clinical scores (AUDIT-C, COPD-PS and PHQ2) it was not possible to define a top decile of risk, as these scores take discrete values and therefore have exact ties that result in participants being on the border of high and low risk; in these cases we randomly assigned individuals with a borderline risk to the high risk or low category with fixed perscore probabilities, with these probabilities chosen to ensure that 10% of individuals were assigned as high risk in the training set. During model fitting variables were transformed to ensure a close to normal distribution (log transformation for FLI, logit transformation for QRISK3, QDiabetes, QCancer and LLPv2). We additionally assessed the performance of the two lung and two liver models in the context of smoking and drinking behavior. Lung cancer and COPD HR for the top decile versus the bottom 90% were evaluated separately for ever and never smokers, and adjusted for pack years of smoking. Liver cirrhosis and alcoholic liver disease HR were evaluated separately for ever and never drinkers, and adjusted for daily units of alcohol consumption. We also computed correlations between the disease scores, pack years and alcohol units. Statistical analyses Prior to epidemiological data analyses, we set any value of the metabolomic biomarkers or PGS to missing if the value was more ±4 standard deviations away from the mean. Samples with any missing information (metabolomic biomarkers, PGS or clinical variables) required for score training or prediction were excluded separately for each analysis. Considering the 36 CE-marked metabolomic biomarkers, 11,676 samples (2.4%) had at least one biomarker missing, and 16,614 samples (3.5%) were excluded due to outlier filtering. Additional analyses comparing performance within excluded samples showed that exclusions had little effect on the results (Fig. S15). We did not filter on genetic ancestry or ethnicity. Metabolomic biomarker measures were log1p-transformed, and all continuous variables were Z-normalized to have a mean of zero and a standard deviation of one in the training set. The means and standard deviations of the training set were subsequently used to scale the metabolomic biomarkers in the testing and replication cohorts. We assigned all participants with a repeat measurement to the testing set (to increase sample size for that analysis), and then split the remaining UK Biobank data in half for training and testing, resulting in a total of 241,246 individuals to train the risk scores. We used 10 years of follow-up and Cox proportional hazards regression modeling with least absolute shrinkage and selection operator (Lasso) and tenfold cross-validation using the R package hdnom. We favored the parsimonious Lasso models for interpretability after observing no consistent pattern of better-performing prediction between Lasso and Elastic Net (Fig. S16). For each disease, we trained nine models, which Nature Communications | (2024)15:10092 https://doi.org/10.1038/s41467-024-54357-0 all contained (1) age and sex, and additionally included (2) metabolomic measures, (3) a disease-specific PGS, (4) metabolomic measures, and a disease-specific PGS, (5) a disease-specific clinical score, (6) a disease-specific clinical score and PGS, (7) a disease-specific clinical score and metabolomic measures, (8) a disease-specific clinical score, PGS and metabolomic measures, and (9) an extended set of metabolomic biomarkers. Age and sex were not penalized to ensure they were selected for each model and appropriately weighted. The extended set of metabolomic biomarkers contained all 249 available absolute and ratio measures. For the genetic ancestry analyses, we trained another set of models identical to the description above but leaving all participants with non-European genetic ancestry out from the training set to maximize the size of this group in the testing set. We computed risk scores (for research use only) as the weighted sum of variables selected during training for each model. The variables that were selected for each model during training and their respective coefficients can be found in Supplementary Data 3. In computing the risk scores, we excluded the sex and age coefficients, which for a combined metabolomic and PGS model for example, effectively results in risk scores comprised of the weighted sum of 1–36 metabolomic measures and a PGS, which have been adjusted for age and sex. We evaluated correlations of the scores by computing Pearson correlation coefficients between different endpoints by model type, between duplicate measurements and between baseline and repeat assessments. We tested risk model performance in all three biobanks by assessing the association of these age- and sex-adjusted scores with disease incidence within the first 4 years (as the EBB has a large number of samples limited to this amount of follow-up) after the blood draw using Cox proportional hazards models and examining the hazard ratios (HR) with 95% confidence intervals between individuals in the top decile of the score and individuals at the bottom 90%. Fixed effect meta-analysis of estimated hazard ratios across the three biobanks was carried out with the R package meta. We calculated the limits for the top deciles for each model in the training set once and subsequently used these to determine top decile classification in validation and replication cohorts (Supplementary Data 4). We assessed HR for 4 and 10 years of follow-up in the UK Biobank for PGS (also including results from the EBB) and disease-specific clinical risk score comparisons. We additionally stratified HR by the source of report of disease incidence (Fig. S17) and whether the individual had primary care data available (Fig. S18). In addition, we computed HR per one standard deviation increments in the age- and sex-adjusted scores using Cox proportional hazards models. We compared these results to another version in which biomarker scaling and decile limits were not determined by the training set, but calculated within each separate cohort (Fig. S4). For statistical significance, we considered p-values < 0.004 corresponding to a 95% confidence level Bonferroni corrected for 12 diseases39,40. We estimated Kaplan–Meier curves stratified by PGS and metabolomic score deciles for 10 years of follow-up using R package survival41 for individuals in the top decile of the PGS score and the bottom 90% of the metabolomic score, the top decile of both the PGS and metabolomic scores, and the bottom 90% of the PGS score. We used the cox.zph function to examine the proportionality of hazards assumption and Schoenfeld residual plots, which revealed that the HR was not constant over the 10-year follow-up time for some disease endpoints. Therefore, we assessed the continuous hazard over the HR in strata across the follow-up time for the metabolomics and PGS models. Using the R package rstpm242 we built a generalized survival model with natural splines using 2 knots to allow for a time-varying effect, and additionally computed the hazard ratio for 1-year strata using the survSplit option in the survival R package41. We also tested for interaction effects between the metabolomic scores and PGS. Area under receiver operating curves (AUC) were estimated utilizing absolute risks using the R package pROC43. Net reclassification improvements (NRI) 11 Article were computed comparing metabolomic and PGS scores to diseasespecific clinical scores. Both continuous and categorical scores were calculated, with categorical scores using limits of top 10% of the scores as cut-off thresholds. The R package nricens44 was utilized in NRI computation. For analyses examining two time points, we had 18,709 UKB participants with metabolomic scores calculated from both baseline and a repeat visit which took place after 2 to 7 years. We considered the eight diseases that had at least 100 cases within 10 years after the repeat visit (349 cases for COPD, 214 for colon cancer, 288 for depression, 439 for diabetes, 303 for myocardial infarction and 225 ischemic stroke). We fitted Cox proportional hazards models for disease events 10 years after the repeat visit with age- and sex-adjusted metabolomic scores at baseline and repeat visit. Disease events between baseline and repeat visit were excluded. To assesses the risk changes between the baseline and repeat-visit, we categorized participants into three groups: those who stayed in the highest decile of metabolomic score at both time-points, those were in the highest decile of metabolomic score at baseline but left the high-risk category at the repeat visit, and those in the bottom 90% of the metabolomic score at baseline. We tested for a difference between the two groups (stayers vs leavers) using Cox regression, including the baseline metabolomic score as a covariate to control for the fact that stayers had, on average, a higher baseline score than leavers. Additionally, we analyzed 5038 participants with two separate blood metabolomics measurement on average five years apart from EBB as above in the case of diabetes, since this was the only disease with sufficient events to assess replication. We assessed clinical characteristics of high-risk individuals, defined as those with a metabolomics model score in the highest decile of at least one of the seven best-performing models: alcoholic liver disease, COPD, cirrhosis of the liver, ischemic stroke, lung cancer, myocardial infarction, and type 2 diabetes. In addition to basic clinical characteristics, we evaluated the frailty index, a measure to quantify aging and health, between high- and low-risk individuals. We calculated the frailty index based on the method described by Williams et al. 20, using 49 self-reported disease outcomes in UK Biobank participants. Participants with at least 10 missing items were excluded. We tested the difference in mean for clinical characteristics between high- and low-risk individuals using a two-sided t-test for continuous variables and a Chi-squared test for categorical variables. To evaluate calibration of the metabolomic scores, we estimated observed and predicted incidence rates in all three biobanks over 3 years of follow-up. We chose to censor at three years to obtain complete and comparable follow-up for as many samples as possible. We estimated calibration slopes and intercepts by fitting logistic regression of individual diseases status (observed risk) on predicted risk23,24. We performed all statistical analyses and modeling in R version 4.3.245. Reporting summary https://doi.org/10.1038/s41467-024-54357-0 platform. The average number of weeks from application submission to data release is 15 weeks for UK Biobank. Data from Estonia Biobank can be accessed through a research application to Institute of Genomics of the University of Tartu (https://genomics.ut.ee/en/content/ estonian-biobank). Data from FINRISK and Health 2000 cohorts can be accessed through a research application to THL Biobank (https://thl.fi/ en/web/thl-biobank). Source data for Figs. 1–4 are provided with this paper. Source data are provided with this paper. Code availability Code to reproduce the figures in this paper is available at: https:// github.com/NightingaleHealth/ukb-nightingale-omics-prediction/. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article. Data availability For reasons of patient confidentiality and to ensure research is carried out in accordance with the terms of consent the data were collected under, data from the three biobanks are available under controlled access. The UK Biobank data are available for approved researchers through the UK Biobank data-access protocol (https://www. ukbiobank.ac.uk/enable-your-research/apply-for-access). The data from the first ~280,000 UK Biobank participants included in the Data Showcase https://biobank.ndph.ox.ac.uk/showcase/label.cgi?id=220, and full dataset is available on the Resesarch Analysis Platform https:// www.ukbiobank.ac.uk/enable-your-research/research-analysis- Nature Communications | (2024)15:10092 15. 16. 17. 18. Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357, j2099 (2017). Lambert, S. A., Abraham, G. & Inouye, M. Towards clinical utility of polygenic risk scores. Hum. Mol. Genet. 28, R133–R142 (2019). Lewis, C. M. & Vassos, E. Polygenic risk scores: from research tools to clinical instruments. Genome Med. 12, 44 (2020). Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018). Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat. Med. 26, 549–557 (2020). Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults. J. Am. Coll. Cardiol. 72, 1883–1893 (2018). Soininen, P., Kangas, A. J., Würtz, P., Suna, T. & Ala-Korpela, M. Quantitative serum nuclear magnetic resonance metabolomics in cardiovascular epidemiology and genetics. Circulation: Cardiovasc. Genet. 8, 192–206 (2015). Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023). Würtz, P. et al. Quantitative serum nuclear magnetic resonance metabolomics in large-scale epidemiology: a primer on -omic technologies. Am. J. Epidemiol. 186, 1084–1096 (2017). Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nat. Med. 28, 2309–2320 (2022). Pietzner, M. et al. Plasma metabolites to profile pathways in noncommunicable disease multimorbidity. Nat. Med. 27, 471–479 (2021). Morze, J. et al. Metabolomics and Type 2 Diabetes Risk: An Updated Systematic Review and Meta-analysis of Prospective Cohort Studies. Diab. Care 45, 1013–1024 (2022). Deelen, J. et al. A metabolic profile of all-cause mortality risk identified in an observational study of 44,168 individuals. Nat. Commun. 10, 3346 (2019). Lauber, C. et al. Lipidomic risk scores are independent of polygenic risk scores and can predict incidence of diabetes and cardiovascular disease in a large population cohort. PLoS Biol. 20, e3001561 (2022). Walford, G. A. et al. Metabolite traits and genetic risk provide complementary information for the prediction of future type 2 diabetes. Diab. Care 37, 2508–2514 (2014). Unterhuber, M. et al. Proteomics-enabled deep learning machine algorithms can enhance prediction of mortality. J. Am. Coll. Cardiol. 78, 1621–1631 (2021). Godbole, S. et al. A metabolomic severity score for airflow obstruction and emphysema. Metabolites 12, 368 (2022). Riveros-Mckay, F. et al. Integrated polygenic tool substantially enhances coronary artery disease prediction. Circ. Genom. Precis. Med. 14, e003304 (2021). 12 Article 19. Thompson, D. J. et al. A systematic evaluation of the performance and properties of the UK Biobank Polygenic Risk Score (PRS) Release. PLoS One 19, e0307270 (2024). 20. Williams, D. M., Jylhävä, J., Pedersen, N. L. & Hägg, S. A frailty index for UK biobank participants. J. Gerontol. Ser. A 74, 582–587 (2019). 21. Mak, J. K. L. et al. Unraveling the metabolic underpinnings of frailty using multicohort observational and Mendelian randomization analyses. Aging Cell e13868 https://doi.org/10.1111/acel.13868 (2023). 22. Alba, A. C. et al. Discrimination and calibration of clinical prediction models: users’ guides to the medical literature. JAMA 318, 1377–1384 (2017). 23. Calster, B. V. et al. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol. 74, 167–176 (2016). 24. Collins, G. S., Ogundimu, E. O. & Altman, D. G. Sample size considerations for the external validation of a multivariable prognostic model: a resampling study. Stat. Med. 35, 214–226 (2016). 25. Kullo, I. J. et al. Polygenic scores in biomedical research. Nat. Rev. Genet. 23, 524–532 (2022). 26. Ritchie, S. C. et al. Cardiovascular risk prediction using metabolomic biomarkers and polygenic risk scores: A cohort study and modelling analyses. Preprint at https://www.medrxiv.org/content/ 10.1101/2023.10.31.23297859v1 (2023). 27. Ko, D. T. et al. Calibration and discrimination of the Framingham risk score and the pooled cohort equations. CMAJ 192, E442–E449 (2020). 28. Gadd, D. A. et al. Blood protein assessment of leading incident diseases and mortality in the UK Biobank. Nat Aging 4, 939–948 (2024). 29. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). 30. Leitsalu, L. et al. Cohort profile: Estonian Biobank of the Estonian Genome Center, University of Tartu. Int. J. Epidemiol. 44, 1137–1147 (2015). 31. Fischer, K. et al. Biomarker profiling by nuclear magnetic resonance spectroscopy for the prediction of all-cause mortality: an observational study of 17,345 persons. PLoS Med. 11, e1001606 (2014). 32. Tikkanen, E. et al. Metabolic biomarker discovery for risk of peripheral artery disease compared with coronary artery disease: lipoprotein and metabolite profiling of 31 657 individuals from 5 prospective cohorts. J. Am. Heart Assoc. 10, e021995 (2021). 33. Borodulin, K. et al. Cohort profile: the National FINRISK Study. Int. J. Epidemiol. 47, 696–696i (2018). 34. World Health Organization. Global health estimates: Leading causes of DALYs. https://www.who.int/data/gho/data/themes/ mortality-and-global-health-estimates/global-health-estimatesleading-causes-of-dalys. 35. Ritchie, S. C. et al. Quality control and removal of technical variation of NMR metabolic biomarker data in ~120,000 UK Biobank participants. Sci. Data 10, 64 (2023). 36. Mitt, M. et al. Improved imputation accuracy of rare and lowfrequency variants using population-specific high-coverage WGSbased imputation reference panel. Eur. J. Hum. Genet. 25, 869–876 (2017). 37. Lambert, S. A. et al. The Polygenic Score Catalog: new functionality and tools to enable FAIR research. Preprint at https://www.medrxiv. org/content/10.1101/2024.05.29.24307783v1 (2024). 38. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). 39. Neyman, J. & Pearson, E. S. On the use and interpretation of certain test criteria for purposes of statistical inference: part I. Biometrika 20A, 175–240 (1928). Nature Communications | (2024)15:10092 https://doi.org/10.1038/s41467-024-54357-0 40. Dunn, O. J. Multiple comparisons among means. J. Am. Stat. Assoc. 56, 52–64 (1961). 41. Therneau T. A Package for Survival Analysis in R. R package version 3.7-0, https://CRAN.R-project.org/package=survival (2024). 42. Liu, X.-R., Pawitan, Y. & Clements, M. Parametric and penalized generalized survival models. Stat. Methods Med. Res. 27, 1531–1546 (2018). 43. Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinforma. 12, 77 (2011). 44. Inoue, E. nricens: NRI for risk prediction models with time to event and binary response dat. https://doi.org/10.32614/CRAN.package. nricens 45. R Core Team. R: The R project for statistical computing. R foundation for statistical computing, https://www.R-project.org/ (Vienna, Austria, 2022). Acknowledgements We acknowledge the lab and spectrometry teams at Nightingale Health for their role in generating the metabolomic data on these cohorts. We are grateful to UK Biobank (Project 30418), Estonian Biobank, and THL Biobank (project BB2016_86) for access to data to undertake this study. We thank all biobank participants for their generous contribution to generating this resource for the scientific community. We acknowledge the Estonian Biobank Research Team of Mari Nelis, Georgi Hudjasov, Reedik Mägi, Andres Metspalu and Lili Milani. Additionally, we thank Kristi Läll, Erik Abner and Kelli Lehto for their help with Estonian Biobank phenotype data. The work was funded by Nightingale Health Plc. Estonian Biobank was supported by Estonian Research Council grant PRG1291. TT was supported by Estonian Research Council grant PSG809. Author contributions Conceptualization: J.C.B., T.E., H.J., P.W. Data Curation: H.J., N.K., S.K., S.N.L., K.S. Formal Analysis: L.J.-D., H.J., N.K., S.K., S.N.L., K.S. Methodology: H.J. Investigation: J.C.B., L.J.-D., H.J., K.H., N.K., S.K., A.K., S.N.L., V.M., K.N., K.S., M.S., P.S., M.T., P.W. Resources: T.E., P.J., T.J., J.K., A.L., M.P., V.S., T.T. Supervision: J.C.B., H.J., P.W. Visualization: L.J.-D., H.J., S.K., S.N.L., K.S. Writing – original draft: J.C.B. Writing – review & editing: J.C.B., L.J.-D., H.J., S.K., S.N.L., N.K., K.S., P.W. Competing interests JCB, LJD, HJ, AK, NK, SK, HK, SL, VK, KN, KS, MS, PS, MT, and PW are employees of and hold shares or stock options in, Nightingale Health. The remaining authors declare no conflict of interest. Additional information Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41467-024-54357-0. Correspondence and requests for materials should be addressed to Jeffrey C. Barrett. Peer review information Nature Communications thanks Themistocles Assimes, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available. Reprints and permissions information is available at http://www.nature.com/reprints Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 13 Article https://doi.org/10.1038/s41467-024-54357-0 Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creativecommons.org/licenses/by-nc-nd/4.0/. © The Author(s) 2024 Nightingale Health Biobank Collaborative Group Jeffrey C. Barrett 1 , Tõnu Esko2, Krista Fischer 2,3, Luke Jostins-Dean1, Pekka Jousilahti 4, Heli Julkunen 1, Tuija Jääskeläinen4, Antti Kangas1, Nurlan Kerimov 1, Sini Kerminen 1, Anastassia Kolde 2,3, Harri Koskela1, Jaanika Kronberg 2, Sara N. Lundgren1, Annamari Lundqvist4, Valtteri Mäkelä1, Kristian Nybo1, Markus Perola4, Veikko Salomaa 4, Kirsten Schut1, Maiju Soikkeli1, Pasi Soininen1, Mika Tiainen1, Taavi Tillmann5 & Peter Würtz 1 1 Nightingale Health, Helsinki, Finland. 2Institute of Genomics, Faculty of Science and Technology, University of Tartu, Tartu, Estonia. 3Institute of Mathematics and Statistics, Faculty of Science and Technology, University of Tartu, Tartu, Estonia. 4Department of Public Health, Finnish Institute for Health and Welfare, Helsinki, Finland. 5Institute of Family Medicine and Public Health, University of Tartu, Tartu, Estonia. e-mail: jeffrey.barrett@nightingalehealth.com Nature Communications | (2024)15:10092 14 Publication V Heli Julkunen, Juho Rousu. Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank. Submitted, July 2024. Available on medRxiv. 177 Machine learning for comprehensive interaction modelling improves disease risk prediction in the UK Biobank Heli Julkunen1,* and Juho Rousu1 1 Department of Computer Science, Aalto University, Espoo, Finland * Corresponding author, heli.julkunen@aalto.fi Abstract Understanding how risk factors interact to jointly influence disease risk can provide insights into disease development and improve risk prediction. We introduce survivalFM, a machine learning extension to the widely used Cox proportional hazards model that incorporates estimation of all potential pairwise interaction effects on time-to-event outcomes. The method relies on learning a low-rank factorized approximation of the interaction effects, hence overcoming the computational and statistical limitations of fitting these terms in models involving many predictor variables. The resulting model is fully interpretable, providing access to the estimates of both individual effects and the approximated interactions. Comprehensive evaluation of survivalFM using the UK Biobank dataset across ten disease examples and a variety of clinical risk factors and omics data modalities shows improved discrimination and reclassification performance (65% and 97.5% of the scenarios tested, respectively). Considering a clinical scenario of cardiovascular risk prediction using predictors from the established QRISK3 model, we further show that the comprehensive interaction modelling adds predictive value beyond the individual and age interaction effects currently included. These results demonstrate that comprehensive modelling of interactions can facilitate advanced insights into disease development and improve risk predictions. Introduction Risk prediction models are needed in modern preventive medicine to identify individuals at high risk of disease before clinical symptoms manifest. The ability to predict disease risk is particularly important in managing complex diseases, such as cardiovascular disease, chronic kidney disease, and diabetes, where early intervention can substantially alter patient outcomes. However, accurately predicting disease risk is challenging due to the inherent complexity of most human diseases, which arise from the interplay of genetic, environmental, and lifestyle factors. Traditional methods in survival analysis, such as the widely used Cox proportional hazards regression [1], assume linear effects of predictor variables on time-to-event outcomes. This assumption may lead to oversimplified models that overlook the complex interplay among predictors, potentially missing important biological insights and limiting risk prediction accuracy. 1 The accuracy of time-to-event prediction models can be improved by incorporating interaction terms, a well-established concept in epidemiology to assess the joint effects of predictors on outcomes [2, 3]. For instance, interaction terms have been shown to be relevant in cardiovascular disease (CVD) risk prediction, where the effects of other risk factors can vary depending on age [4, 5, 6, 7]. However, incorporating these terms in multivariable prediction models typically requires prior hypotheses about which interactions to include. As the number of potential interaction terms increases quadratically with the number of predictor variables in consideration, inclusion of all potential interactions quickly becomes impractical without targeted hypotheses to guide the selection. Therefore, prior multivariable prediction models have typically been constrained to a restricted set of interaction terms known to alter outcome associations, such as those involving age. This limits the discovery of new, potentially relevant interactions. Another commonly employed strategy is to perform statistical testing of individual interaction terms, but this can miss interactions that only become relevant for prediction in the presence of other variables. This challenge becomes particularly pronounced with modern biomedical datasets, which can contain hundreds of potential predictors. While machine learning survival analysis extensions like random survival forests [8] and deep survival models [9, 10] can capture complex non-linearities and interactions in the underlying data, they often compromise interpretability, which is crucial when the goal is to inform clinical decision-making or to obtain insights into the risk factors underlying disease development. To enhance possibilities to understand and model the joint effects of risk factors on time-to-event disease outcomes, we here present survivalFM, a methodological extension to the Cox proportional hazards model that incorporates estimation of all potential pairwise interaction effects among predictor variables. The method is based on an efficient strategy of learning the interaction effects using a low-rank factorized approximation, a concept taken from factorization machines (FMs) [11] and here applied to survival analysis. survivalFM combines the factorization of the interaction effects with an efficient quasi-Newton optimization algorithm, thereby overcoming the computational and statistical challenges of fitting comprehensive interaction effects in time-to-event prediction models involving many variables. The resulting model is fully interpretable, providing access to the estimates of both individual effects and the approximated interactions. We demonstrate the performance of survivalFM across various data modalities and disease outcomes using data from the UK Biobank. We further highlight an application in a clinical cardiovascular risk prediction scenario and show that survivalFM can learn predictive interaction effects which improve identification of high-risk individuals. While we highlight applications in disease risk prediction, the method is generally applicable to modelling any type of time-to-event outcomes. Results Overview of survivalFM Figure 1 presents an overview of survivalFM. We developed survivalFM to estimate all potential pairwise interaction effects among input variables for right-censored survival data, such as time to disease onset. It is based on the widely used proportional hazards model [1] which relates the time until an event occurs to a set of predictor variables through a hazard function of the form: h(t|x) = h0 (t) exp(f (x)) (1) where h(t|x) represents the hazard for an individual at time point t, with the baseline hazard function h0 (t) describing the time-varying hazard and the partial hazard exp(f (x)) quantifying the 2 impact of the predictor variables x on the baseline hazard. In the standard formulation of the Cox proportional hazards model, the partial hazard exp(f (x)) is assumed to be parametrized by a linear combination of the predictor variables f (x) = β x, with β giving the weights for the individual variables. In many applications, understanding how variables may interact to jointly impact the hazard rate can provide additional value beyond their independent linear effects. However, directly fitting all potential pairwise interaction terms in a multivariable prediction model quickly becomes challenging due to the quadratic increase in the number of interaction terms as a function of the number of input variables. Hence, we propose survivalFM, an extension which adds an approximation of all pairwise interaction effects using a factorized parametrization approach (Figure 1a-b): f (x) = β x + pi , pj xi xj (2) β̃i,j xi xj = β x + 1≤i=j≤d 1≤i=j≤d where ·, · denotes the inner product and d denotes the number of predictor variables. The first part contains the linear effects of all predictor variables in the same way as in the standard formulation of the Cox proportional hazards model. The second part contains all pairwise interaction effects between the predictor variables xi and xj . However, instead of directly estimating the interaction effects βi,j , the factorized parametrization approximates the effects using an inner product between two low-rank latent vectors β̃i,j = pi , pj . The parameter vectors pi ∈ Rk and pj ∈ Rk are the row vectors of a low-rank parameter matrix P ∈ Rd×k (Figure 1a). Hence, this results in much fewer parameters to estimate, as the rank of the factorization is typically much lower than the total number of predictor variables (k d). With this approach, we avoid the statistical and computational problems that would be encountered with direct estimation of all interactions terms in the presence of many predictor variables, while still maintaining interpretability. The idea of using factorized parametrization strategy originates from factorization machines (FMs) [11], originally proposed for regression and classification tasks in the context of recommender systems. For more details of the model and the fitting procedure, see Methods. Study population, disease outcomes and data modalities To evaluate whether survivalFM could improve risk prediction models and provide new insights on the joint effects of risk factors on disease onset, we performed analyses using data from the UK Biobank. This cohort comprises a total of approximately 500,000 participants from the UK, enrolled in 21 recruitment centers across the country. The UK Biobank is renowned for its comprehensive phenotyping and molecular profiling, including routine blood biomarkers and advanced ’omics measurements such as genomics and metabolomics. Baseline characteristics of the study population and a summary of the datasets studied here are summarized in Supplementary Table 1. As disease outcomes, we considered the 10-year incidence of ten example diseases, selected to comprise common diseases and diseases which can benefit from intervention if identified early (Supplementary Tables 2 and 3), excluding participants with a prior record of the disease at baseline. To assess the performance across different data modalities, we considered four different prediction scenarios that incorporate an array of predictors ranging from traditional clinical predictors to more advanced omics-based data sources (Figure 1c, Methods). In the first scenario, we started from a set of standard cardiovascular risk factors included in the ASCVD risk estimator plus [12], widely recognized in various primary prevention scores. Since these factors have been shown to be predictive beyond cardiovascular diseases [13, 14, 15], we included them as standard risk factors across all 3 analyzed disease examples. We then added sets of more complex data layers to these standard risk factors (Figure 1c). In the second scenario, we added a comprehensive set of hematologic and clinical biochemistry measures to the standard risk factors; in the third scenario, we incorporated a wide range of metabolomic biomarkers, recently shown promise as an assay to inform on multidisease risk [13, 16]; and finally, we included a set of polygenic risk scores for both disease and quantitative traits [17], which have gained interest for their potential to enhance risk prediction models by providing complementary information to traditional risk factors [18, 19, 20]. survivalFM improves risk prediction across various diseases and data modalities The practical utility of any risk prediction model is determined by its ability to stratify risk and identify high-risk individuals. We evaluated the ability of survivalFM to predict future disease risk and benefit from the comprehensive interaction terms by comparing its performance to standard linear Cox proportional hazards regression (Figure 1b), employing L2 (Ridge) regularization in both methods to control model complexity and prevent overfitting (Methods). The performance of the models was evaluated in 10-fold cross-validation, using 20% validation set within each cross-validation cycle to optimize regularization parameters. Analyses were consistently applied across the same sets of predictor variables and fixed cross-validation folds. By modelling the comprehensive interactions present in the underlying data, survivalFM improved the discriminatory performance across a majority of the studied examples as measured by concordance index (C-index; Figure 2). Specifically, statistically significant improvements were noted in 26 of the 40 evaluated scenarios (65%), with a mean improvement in concordance index (ΔC-index) of 0.005. Minor improvements were noted in another 12 out of 40 (30%) of scenarios (mean ΔC-index 0.001). Importantly, none of the studied examples demonstrated a statistically significant decrease in performance with survivalFM, highlighting the robustness of survivalFM. Absolute values for the Cindices are detailed in Supplementary Table 4, demonstrating good discriminative performance across all models with C-indices in the range 0.72–0.93. Moreover, all models were well calibrated across the UK Biobank cohort (Supplementary Figures 2-5). Given that even modest improvements in the C-index at the population level can substantially affect individual risk predictions, we also evaluated the model performance using continuous net reclassification improvement (NRI), which has been shown to provide complementary information on risk model performance [21, 22]. The continuous NRI quantifies the extent to which the model appropriately increases the predicted probabilities for subjects who experience events and decreases them for those who do not. This metric is particularly useful in the absence of established clinical thresholds for high-risk groups, as it quantifies the improvement in risk prediction without relying on predefined risk cutoffs and thus facilitates comparisons across different diseases. In terms of the continuous NRI, survivalFM yielded significantly improved resclassification in 39 out of 40 (97.5%) of the studied examples, with a mean continuous NRI of 37%. Therefore, despite the relatively modest improvement magnitudes in the C-indices, the continuous NRI indicated notable positive changes in individual risk predictions. For instance, type 2 diabetes modelled using clinical biochemistry and blood counts data demonstrated the highest continuous net reclassification improvement of 94% (95% CI 92%-96%), corresponding to 33% (95% CI 32%-35%) of events and 61% (95% CI 60%-61%) of non-events having improved risk estimates (Supplementary Figure 6). Similarly, chronic liver disease models demonstrated notable improvements across all data modalities, with continuous net reclassification improvements ranging between 17%-88%. These findings suggest that the interaction terms carry additional predictive information across 4 various disease and data modalities and survivalFM can model this residual contribution. While the extent of improvement varied depending on the specific disease and dataset under study, improvements were consistently observed across multiple disease areas and data types. Disease-specific interaction profiles A key advantage of survivalFM is that despite introducing a more complex layer of non-linearity through the interaction terms, it still maintains interpretability and transparency of how the model predictions are made. Analysis of the estimated interaction effects revealed that in many cases there was a diverse interaction landscape contributing to these predictions, demonstrating that the observed performance gains are likely to stem from the cumulative benefit of many small interaction effects rather than a few prominent ones (examples shown in Supplementary Figures 7-11). Here, we will highlight a few examples with the most notable performance gains. Inclusion of interaction terms was particularly advantageous in liver-related conditions, such as when predicting alcoholic liver disease or liver fibrosis and cirrhosis using standard risk factors or metabolomic biomarkers. In both liver disease models derived using standard risk factors, among the most prominent interactions were those among different cholesterol measures, cholesterol-lowering medication, and sex (Supplementary Figures 7-8). These results suggest that the joint effects of these risk factors further explain the risk of chronic liver disease outcomes beyond their additive linear effects. The model for alcoholic liver disease also highlighted interactions with white ethnic background, suggesting variation in risk factor profile by ethnicity. Additionally, smoking status was highly weighted both individually and in the interactions, aligning with the earlier research suggesting that smoking may exacerbate the influence of the other risk factors in the development chronic liver diseases [23]. In the case of liver disease models derived using metabolomics biomarkers, both alcoholic liver disease and liver fibrosis and cirrhosis models weighted highly interactions across various amino acids along with their individual effects (Supplementary Figures 9-10). These observations align with previously reported changes in amino acid metabolism related to chronic liver diseases [24, 25], with these results suggesting that the associations of amino acids with the chronic liver disease outcomes are also characterised by complex joint effects. Furthermore, both models emphasized a strong interaction between acetate and glutamine, with acetate having a notably pronounced interaction profile in the model for alcoholic liver disease. Given the known roles of acetate and glutamine in alcohol metabolism and lipid accumulation in the liver [26, 27], these findings indicate that the joint presence of high levels of both these metabolites indicates an even higher risk of chronic liver diseases. A contrasting example was type 2 diabetes modelled using clinical biochemistry and blood counts data, which obtained the highest observed continuous NRI. Unlike the other examples, analysis of the model coefficients revealed that the model weights were predominantly concentrated around glycated hemoglobin (HbA1c) and its interactions across the other variables (Supplementary Figure 11). The highest interaction weight was attributed to the interaction between HbA1c and glucose, which was negatively weighted despite their positive individual effects. This likely reflects the fact that the simultaneous elevation of both HbA1c and glucose does not increase risk additively but rather relates to them being correlated measures of blood glucose regulation and overall glycemic control. Additionally, the model highlighted positively weighted interactions of HbA1c with age, white ethnicity, and urate levels, indicating these factors together might amplify the risk. In contrast, interactions between HbA1c and reticulocyte count and body mass index were negatively weighted. 5 survivalFM benefits from large training data sizes To understand the impact of training data size on model performance and the ability of survivalFM to leverage interaction terms, we conducted analyses with models trained on varying-sized subsets of the training data. Throughout these analyses, the test and validation sets were held fixed, allowing us to analyze how changes only in the number of training individuals influence model performance. Figure 3 shows the discriminatory performance of survivalFM as a function of the number of training individuals for the input dataset involving standard risk factors (results for the other predictor sets are shown in Supplementary Figures 12-14). These results demonstrate a clear dependency on large sample sizes to uncover predictive interaction terms, with survivalFM generally requiring at least 50,000 individuals in training to outperform standard Cox regression. The discriminatory performance of survivalFM shows a positive trend and increasing gap to standard Cox regression with increasing sample sizes, although the gains often begin to plateau at the upper end of the sample size range. survivalFM improves prediction performance in a clinical cardiovascular risk prediction scenario To explore whether comprehensive interaction modeling via survivalFM could also refine well-established clinical risk prediction models, we conducted analyses in a clinical CVD risk prediction setting using predictors from the QRISK3 model [5]. QRISK models are Cox proportional hazard models used for predicting the patient’s 10-year risk of CVD, recommended by the healthcare guidelines in the UK. The latest version, QRISK3 from 2017 [5], incorporates a variety of risk factors and comorbidities, along with a set of their interaction terms with age. We aimed to determine if comprehensive modelling of the interaction terms among the QRISK3 risk factors using survivalFM could improve the model’s ability to predict cardiovascular risk. The endpoint was defined as 10-year incidence of composite CVD, including coronary heart disease, ischemic stroke, and transient ischemic attack, and including both fatal and non-fatal events (Supplementary Table 5, Methods). Following the exclusion criteria from the QRISK3 derivation study, we excluded participants with prior CVD diagnoses and those on a cholesterol-lowering medication at the study entry. The baseline characteristics of the study population in this clinical prediction scenario are detailed in Supplementary Table 6. To ensure a fair comparison of the models, we retrained the QRISK3 model in the UK Biobank considering the same set of risk factors (Methods). As prior research has shown QRISK3 to systematically overestimate CVD risk in the UK Biobank [28], retraining the model ensures an accurate calibration for this cohort. We evaluated three models of increasing complexity: 1) a standard Cox regression model with linear terms only, 2) a Cox regression model incorporating linear terms and age interaction terms from the QRISK3 model, and 3) a survivalFM model including linear terms and all potential factorized pairwise interaction terms. In terms of discrimination performance measured by C-index, survivalFM showed statistically significant improvements over the compared models (Figure 4a, Supplementary Table 7). Specifically, it improved the discrimination performance by 0.0018 (95% CI 0.0013-0.0023) over the standard Cox model with linear terms only, and by 0.0014 (95% CI 0.0010-0.0019) over the Cox model including the also the current age interaction terms from QRISK3. Notably, the inclusion of age interaction terms from QRISK3 improved the discrimination performance by 0.0004 (95% CI 0.0000-0.0008) over the model with linear terms only. Hence, modelling the comprehensive interactions using survivalFM more than four times improved the discrimination performance gains compared to only incorporating 6 the currently included age interactions. To further assess how well the models reclassified individuals into appropriate risk categories, we computed categorical net reclassification improvements (NRI) at the guideline recommended 10% absolute risk threshold [29]. Incorporating the currently included age interaction terms from QRISK3 resulted in an overall NRI of 0.66% (95% CI 0.40%-0.93%) compared to the model with linear terms only (Figure 4b). The results for survivalFM showed a greater overall NRI of 1.47% (95% CI 1.12%1.82%), again demonstrating further gains beyond the currently included age interaction terms. survivalFM accurately reclassified 3.18% of individuals who experienced an event into the high-risk category, while it inappropriately reclassified a smaller portion of 1.71% of non-events as high-risk (Supplementary Table 7). These improvements are also visible in the reclassification plots (Figure 4c) showing how the individual predictions change with the inclusion of new model terms. All models were well calibrated (Supplementary Figure 15a) and exhibited broadly similar distributions across the risk spectrum (Supplementary Figure 15b). Analysis of the model coefficients from survivalFM revealed a broad array of interactions contributing to the CVD predictions. The ratio of total cholesterol to HDL cholesterol demonstrated the most pronounced interaction profile among all predictor variables (Figure 5). This suggests that the effect of the cholesterol ratio on CVD risk is influenced by the presence of other risk factors. For example, the interaction weight for the cholesterol ratio with prevalent atrial fibrillation was negative, despite both factors having positive individual weights. This suggests that these variables capture partly overlapping aspects of cardiovascular risk. Atrial fibrillation is often associated with a broader cardiovascular risk [30], which could already be reflected in the elevated cholesterol ratio. This may thus imply that when both risk factors are present, they do not independently add to the risk. Comparing the estimated effects for the model terms overlapping between survivalFM and the standard Cox regression model with linear and age interaction terms from QRISK3, the shared terms exhibited very similar weights, with correlation of 0.97 between the estimated effects by the two methods (Supplementary Figure 16). This shows that despite the introduction of complex interactions, the fundamental risk associations remain broadly consistent. Discussion Accurate prediction of disease onset and prognosis is essential to realize preventative medicine. In this study, we have introduced survivalFM, a new machine learning method for multivariable time-to-event prediction. The method extends the widely used Cox proportional hazards regression by estimating all potential pairwise interaction effects among predictor variables on time-to-event outcomes, such as disease onset. We have shown that estimating these comprehensive interaction effects improves risk prediction and refines individual risk predictions across a range of common diseases, providing more nuanced insights into the interplay among factors underlying disease risk. Since survivalFM generalizes to other use cases, we expect this method to find applications in precision medicine and benefit survival modelling in large studies involving many predictors. Our results from UK Biobank revealed that survivalFM can identify predictive interaction terms, which are missed when using standard Cox proportional hazards regression. This ability to uncover predictive interaction terms extended across various disease outcomes and data modalities. Importantly, survivalFM consistently matched or surpassed the performance of the standard Cox regression model. This robustness is by design, as survivalFM separates linear effects from interaction effects and, by appropriate tuning of model hyperparametrs, can assign negligible weight to non-contributory 7 interaction effects while emphasizing predictive ones. These findings highlight the utility of survivalFM in refining risk prediction models across various prediction scenarios, including models derived from traditional clinical predictors and modern omics data types. Our results further showed that survivalFM can add predictive value in practical clinical risk prediction scenarios, such as in CVD risk prediction using predictors from the established QRISK3 model. CVD remains as the leading cause of mortality worldwide [31], making accurate risk stratification critical for healthcare providers to allocate preventive measures effectively. Applying survivalFM to QRISK3 risk factors improved both discrimination and reclassification at the clinically recommended 10% risk threshold, more than doubling the performance gains obtained from the current model’s age-related interaction terms alone. For context, while a recent study [18] reported a 1.3% net reclassification improvement by adding a polygenic risk score to a CVD prediction model in a similar scenario involving QRISK3 risk factors, survivalFM achieved a comparable improvement by optimizing the use existing clinical variables. A key strength of survivalFM is that despite introducing non-linearity through the comprehensive interaction terms, it maintains interpretability by providing the estimated effects for both the individual terms and the approximated interactions. This is unlike many other advanced machine learning techniques, which often lack transparency. Another advantage of survivalFM is a straightforward training process, which only involves optimizing the regularization parameters and setting the rank for factorizing the interaction parameters. We anticipate the accompanying R package will facilitate rapid adoption of the method in other prediction studies. Interpretation of the trained models suggested that in many cases numerous small interaction effects collectively enhanced the prediction accuracy, highlighting the importance of modeling the entire interaction landscape. However, we also showed that capturing these interaction effects generally requires a large sample size. This can limit the method’s applicability in smaller cohorts and settings with lower sample sizes. Therefore, future studies in adequately powered cohorts are needed to assess the consistency of the identified interactions and gains in prediction accuracy across diverse populations. Whilst large sample size is needed, many biobank initiatives are emerging with clinical and omics data at scale. Our results indicate such initiatives could be used as a base for discovering and replicating comprehensive risk factor interactions that are missed by conventional statistical methods. The generalizable nature of survivalFM makes it applicable also to other data modalities than those highlighted in this paper. For instance, comprehensive modelling of interactions across omics data modalities could provide valuable insights into the molecular interplay behind disease risk. Another use case could be studies of protein interaction patterns in relation to disease onset. Recent studies in UK Biobank have demonstrated the strong promise proteomics data in predicting various diseases [32, 33, 34]. Given that proteomics data in UK Biobank comprises around 3 000 measured proteins, the number of potential interactions is in millions. While the current sample size of 50 000 with proteomics data in UK Biobank is at the lower limit for comprehensive interaction modelling, with a sufficient sample size, survivalFM could be used to uncover protein interactions predictive of disease onset and potentially provide further insights for personalized treatment strategies. In conclusion, survivalFM provides an advancement in survival analysis, enhancing disease risk prediction by effectively incorporating comprehensive interaction terms. Our findings provide a foundation for future research and translation of risk prediction models, emphasizing the importance of interaction effects in understanding disease development and refining risk prediction models. 8 Methods survivalFM: Extending the proportional hazards model with factorized interaction terms Survival data Throughout this paper, we assume right-censored survival data. This means that the outcome consists of two variables: the event of interest (here, disease onset) and the time from the beginning of the study period until either to the occurrence of the event, patient loss to follow-up, or end of the duration of follow-up (i.e. right censoring). The survival dataset D consists of tuples D = {(xi , ti , δi )}N i=1 , where xi represents a vector of predictor variables for the individual i, ti marks the observed time to the event of interest or to the point of censoring, and δi is an indicator function which denotes whether ti corresponds to an actual event occurrence (δi = 1) or censored observation (δi = 0). Model formulation We base survivalFM on the widely used proportional hazards model [1] which relates the time until an event occurs to a set of predictor variables through a hazard function of the from: h(t|x) = h0 (t) exp(f (x)) (3) where h0 (t) is a shared baseline hazard function that varies over time, and exp(f (x)) is a partial hazard that describes the effects of the predictor variables on the baseline hazard. In the standard formulation of the Cox proportional hazards model, the partial hazard exp(f (x)) is assumed to be parametrized by a linear combination of the variables of the individual, f (x) = β x, with β representing the coefficients or parameters of the model assigning weights to the individual variables xi . In this study, in addition to the individual effects of the variables, we propose to add an approximation of all pairwise interaction terms using a factorized parametrization of the coefficients, following the approach originally introduced along with factorization machines [11] in the context of recommender systems: f (x) = β x + pi , pj xi xj (4) β̃i,j xi xj = β x + 1≤i=j≤d 1≤i=j≤d where ·, · denotes the inner product. The first part contains the linear effects of the predictor variables in the same way as in the standard formulation of the Cox proportional hazards model. The second part contains all pairwise interactions between the predictor variables xi and xj . However, instead of directly estimating the interaction weights βi,j , the factorized parametrization approximates the coefficients using an inner product between two latent vectors β̃i,j = pi , pj . The low-rank factor vectors pi ∈ Rk and pj ∈ Rk are collected into a parameter matrix P ∈ Rd×k (Figure 1a). Rank k is a hyperparameter that defines the dimensionality of the factor vectors, and usually the optimal rank of the factorization is much lower than the number of input predictors (k d). Parameter estimation Following the standard Cox proportional hazards regression, we estimate the model parameters θ using a partial likelihood function L(θ|D). For each individual who experiences an event at time t, 9 their likelihood contribution is the ratio of the hazard of that individual to the cumulative hazard of all other individuals at risk at the same time point, multiplied across all individuals with event occurrence. Formally, this can be expressed as follows: L(θ|D) = i:δi =1 h0 (t) exp(f (xi )) exp(f (xi )) = h (t) exp(f (x )) 0 j j∈R(ti ) j∈R(ti ) exp(f (xj )) (5) i:δi =1 where xi denotes the vector of predictor variables for an individual i, ti is the observed event time for individual i and R(ti ) denotes the risk set at time ti . Being in the risk set essentially means that the individual has not had an event yet or that their censoring date has not passed yet. Here, f (x) corresponds to the log-risk function from eq. (4) containing the individual effects and all pairwise interaction terms in a factorized form. As the baseline hazard function h0 (t) is assumed to be shared across all individuals, it cancels out when calculating the partial likelihood, hence eliminating the need for its specification, a key feature of the Cox proportional hazards model rendering it semi-parametric. To find the optimal parameters θ = {β, P}, instead of maximizing the partial likelihood, one can equivalently minimize the negative log-likelihood to obtain a more convenient formulation. Taking the logarithm of the partial likelihood function yields a log-likelihood function of the form: l(θ|D) = log( i:δi =1 ⎛ ⎛ ⎞⎞ exp(f (xi )) ⎝f (xi ) − log ⎝ )= exp(f (xj ))⎠⎠ j∈R(ti ) exp(f (xj )) i:δi =1 (6) j∈R(ti ) To overcome overfitting in scenarios involving many predictor variables, one can include regularization terms. Here, we consider L2 regularization (Ridge). Hence, the regularized learning problem is given by: 2 (7) arg min − l(θ|D) + λ1 ||β||22 + λ2 ||P||22 n β,P where λ1 and λ2 are the regularization parameters for the individual effects and the factorized interactions, respectively. Using separate regularization parameters for the individual effects and the interactions allows for individual penalization of these two parts. The log-likelihood is scaled by a factor of 2/n for convenience and to follow the definition from the popular glmnet R package [35], used for the standard Cox regression comparison in this study. Gradient of the negative log-likelihood function l(θ|D) with respect to the model parameters θ = {β, P} is given by: ⎛ ⎛ ⎞⎞ ∂ ∂ ⎝f (xi |θ) − log ⎝ l(θ) = exp(f (xj |θ))⎠⎠ ∂θ ∂θ n:δn =1 j∈R(ti ) ⎞ ⎛ ∂f (xj |θ) exp(f (xj |θ)) |θ) ∂f (x j∈R(t ) i ∂θ i ⎠ ⎝ = − ∂θ j∈R(ti ) exp(f (xj |θ)) (8) (9) n:δn =1 where ∂f (x|θ) = ∂θ xi xi nj=1 pj,f xj − pi,f x2i 10 if θ is βi if θ is pi,f (10) The sum nj=1 pj,f xj is independent of i and thus can be precomputed [11]. In addition, the ∂ gradients of the L2 regularization terms are given by ∂θ λ||θ||22 = 2λθ. To solve (7), we use an efficient BFGS (Broyden–Fletcher–Goldfarb–Shanno) quasi-Newton algorithm [36, 37, 38, 39], as implemented in the base R stats package [40]. In contrast to the standard Newton-Raphson method, the BFGS algorithm uses an approximation of the Hessian to determine the search direction. Due to the factorization of the interaction parameters, the number of estimated parameters remains moderate even in the presence of many predictor variables, making the computation of the Hessian approximation feasible. Empirical evidence from our analyses indicated that alternative stochastic gradient descent (SGD)-based optimization methods, commonly employed in machine learning, were not as effective here. Study population The UK Biobank is a comprehensive prospective cohort study serving as a major globally available health research resource. It includes data from approximately half a million participants aged 37-73, representing a sample from the general UK population. The participants were recruited through 22 assessment centers throughout England, Wales, and Scotland between 2006 and 2010. The followup is still ongoing. Further details of the study protocol and data collection are available online (https://www.ukbiobank.ac.uk/media/gnkeyh2q/study-rationale.pdf) and in the literature [41]. The UK Biobank study was approved by the North West Multi-Centre Research Ethics Committee and all participants provided written informed consent. In this study, the data was accessed under UK Biobank project ID 147811. Predictor variable sets Standard risk factors As standard risk factors, we included predictors from the ASCVD risk estimator plus [42, 12], which are also commonly featured in other primary prevention tools. These demographic and cardiovascular risk factors have been shown to be predictive of diseases beyond CVD [13, 14, 15]. These included age, sex, ethnic background, systolic and diastolic blood pressure, total, HDL, and LDL cholesterol, smoking status, prevalent type 2 diabetes (excluded from the analyses related to type 2 diabetes), hypertension, and cholesterol-lowering treatment, further detailed in Supplementary Table 8. This data was extracted from the data collected at the study’s initial recruitment visit. Prevalent diabetes status was extracted from primary care records, hospital episode statistics, and self-reported conditions during the initial assessment. These standard risk factors were included in all models trained. Clinical biochemistry and blood counts A comprehensive set of clinical biochemistry measures were provided by UK Biobank for blood samples taken at the initial recruitment visit and have been previously described in the literature [43, 44]. These included hematologic markers (complete blood counts, white blood cell populations and reticulocytes) and a wide range of blood biochemistry measures covering established risk factors, diagnostic biomarkers and other chracterisation of phenotypes, such as measures for renal and liver function. Nucleated blood cell counts were excluded from our analyses due to over 99% of the cohort having these recorded as missing or zero. We also excluded estradiol, rheumatoid factor and lipoprotein (a), due to a large portion of the cohort (>20%) having these recorded as missing or under the limit of detection. The blood sample handling and storage 11 protocol has been previously described in the literature [45]. A complete list of the included variables is provided in Supplementary Table 9. Metabolomics biomarkers The metabolomics data included 168 lipids and metabolites from a high-throughput NMR metabolomics assay, available for the baseline blood samples from approximately 275,000 individuals in the UK Biobank. The metabolite data covers a wide range of small molecules, such as amino acids, inflammation markers and ketones, as well as lipids, lipoproteins and fatty acids. Percentage ratios calculated from these 168 original measures were excluded from our analyses. Details of the metabolite data have been previously described [16]. A complete list of the included metabolomics biomarkers is included in Supplementary Table 10. Polygenic risk scores The polygenic risk score data included 53 polygenic risk scores (PRS) released by the UK Biobank and described in [17]. These included scores for both disease traits and quantitative traits. In our analyses, we included only the standard PRS set obtained entirely from external genome-wide association study (GWAS) data. As provided in the UK Biobank, the score distributions were already centered at zero across all ancestries using a principal component-based ancestry centering step. A complete list of the included variables is provided in Supplementary Table 11. QRISK3 risk factors We matched the risk factors from the QRISK3 model with the corresponding variables available in the UK Biobank. These variables were gathered during the baseline assessment visit and included cholesterol levels measured from blood samples and prevalent disease diagnoses obtained from linked hospital records, primary care data, and self-reported conditions. In instances where an exact match for a QRISK3 model risk factor was unavailable in the UK Biobank, the closest equivalent field was utilized. A complete list of the predictors and their corresponding UK Biobank fields is provided in Supplementary Table 12. Disease endpoint definitions For the highlighted examples across different data modalities and 10-year incidence of ten different disease outcomes, each of the outcomes was defined by the earliest occurrence in primary care, hospital episode statistics or death records, using the first occurrences data field from UK Biobank (category 1712). For lung cancer, we additionally included data from the cancer registry. The endpoints were defined based on 3-character ICD-10 codes, detailed in Supplementary Table 2. Participants with a previous diagnosis of the disease under study were excluded from the analysis of each endpoint. The analysis of QRISK3 predictors focused on a 10-year composite CVD outcome, defined according to the original QRISK3 derivation study [5], including coronary heart disease, ischemic stroke, and transient ischemic attack. The ICD-10 codes used are detailed in Supplementary Table 5. We used the earliest recorded date of cardiovascular disease on any of the three data sources (primary care, hospital episode statistics and death records) as the outcome date, using the first occurrences data field from UK Biobank (category 1712). Participants with a prior CVD diagnosis and those on a cholesterol lowering medication at the start of the study were excluded from the analyses, following the exclusion criteria from the original QRISK derivation study [5]. 12 Data partitions and preprocessing The model training and testing was performed using a 10-fold cross-validation approach. In each cross-validation cycle, one of the 10 folds at a time was set aside as a test set, aggregating the remaining partitions to form a training set. From this training set, we randomly selected 20% to use as the validation set. Within each of the 10 cross-validation loops, the individual test set remained untouched throughout model development and the validation set was used for model hyperparameter selection. After selecting the optimal hyperparameters, the validation and training sets were combined to train the final model. All 10 obtained models were then evaluated on their respective test sets, and the results from the test sets were aggregated for the final evaluation. For data preprocessing, log-normally distributed continuous variables (concerning clinical biochemistry markers, blood counts and metabolomics biomarkers) were log1p-transformed (i.e. taking the logarithm of the given value plus one). Outliers exceeding 4 standard deviations from the mean were winsorized. Continuous variables were scaled to zero mean and unit variance and categorical variables were one-hot encoded. The means and standard deviations used for scaling were calculated from the training set and subsequently applied to the validation and test sets. To maximize sample size for the model training, missing values were imputed within the training set using k-nearest neighbors (kNN) imputation (k = 10). To ensure no data leakage between the training, validation and test sets, imputation was exclusively performed within the training set, sparing the validation and test sets from the imputations. Hence, for the validation and test sets, only individuals with complete data available were included. Model hyperparameter tuning In both the standard linear Cox regression and our proposed survivalFM method, we employed L2 (ridge) regularization to control model complexity and prevent overfitting. This requires tuning the regularization parameter λ. For survivalFM, we allowed differing regularization strengths for the linear (λ1 ) and the interaction part (λ2 ), to separately control the influence of main effects and interaction effects. All regularization parameters were optimized by considering a series of equally spaced values on a logarithmic scale between {1, 1−4 }. In addition, survivalFM requires setting the rank of the factorization (k) for the interaction parameters, which was here set to k = 10. Analysis of model performance The standard linear Cox regression models used for the comparisons were trained using the glmnet [46, 35] R package. Concordance indices (Harrel’s C-index) were computed using the R package survival [47] and net reclassification improvements using the R package nricens [48]. Confidence intervals for all metrics were calculated with 1000 bootstrapping iterations. Statistical inferences about differences were based on the distributions of bootstrapped performance difference metrics by considering performances statistically significantly different when the 95% confidence intervals did not overlap zero. All analyses were performed using R version 4.3.1 [40]. Code availability The method developed in this study has been made available as an R-package and can be installed from: https://github.com/aalto-ics-kepaco/survivalfm. 13 Data availability UK Biobank data are available to researchers upon application at https://www.ukbiobank.ac.uk/enableyour-research/apply-for-access. Author contributions H.J. conceived the idea and designed the method with input from J.R. H.J. processed the data, wrote the code, performed the analyses, prepared the figures and wrote the manuscript. J.R. supervised the work and contributed to writing the manuscript. Both authors read and approved the final manuscript. Acknowledgments This work was supported by the Research Council of Finland grants 339421 (Machine Learning for Systems Pharmacology, MASF, 2021-2025) and 345802 (AI technologies for interaction prediction in biomedicine, AIB, 2022-2024). The authors acknowledge the computational resources provided by the Aalto Science-IT project. References [1] Cox, D. R. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34, 187–202 (1972). [2] Harrell Jr, F. E., Lee, K. L. & Mark, D. B. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in medicine 15, 361–387 (1996). [3] Corraini, P., Olsen, M., Pedersen, L., Dekkers, O. M. & Vandenbroucke, J. P. Effect modification, interaction and mediation: an overview of theoretical insights for clinical investigators. Clinical Epidemiology 331–338 (2017). [4] Prospective Studies Collaboration. Age-specific relevance of usual blood pressure to vascular mortality: a meta-analysis of individual data for one million adults in 61 prospective studies. The Lancet 360, 1903–1913 (2002). [5] Hippisley-Cox, J., Coupland, C. & Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ 357 (2017). [6] SCORE2 working group and ESC Cardiovascular risk collaboration. SCORE2 risk prediction algorithms: new models to estimate 10-year risk of cardiovascular disease in Europe. European Heart Journal 42, 2439–2454 (2021). [7] Kaptoge, S. et al. World health organization cardiovascular disease risk charts: revised models to estimate risk in 21 global regions. The Lancet Global Health 7, e1332–e1345 (2019). [8] Ishwaran, H., Kogalur, U. B., Blackstone, E. H. & Lauer, M. S. Random survival forests. The Annals of Applied Statistics 2, 841–860 (2008). 14 [9] Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology 18, 1–12 (2018). [10] Nagpal, C., Li, X. & Dubrawski, A. Deep survival machines: Fully parametric survival regression and representation learning for censored data with competing risks. IEEE Journal of Biomedical and Health Informatics 25, 3163–3175 (2021). [11] Rendle, S. Factorization machines. In 2010 IEEE International conference on data mining, 995–1000 (IEEE, 2010). [12] American College of Cardiology. ASCVD Risk Predictor Plus https://tools.acc.org/ASCVD-Risk-Estimator-Plus. Date accessed: 2024-04-30. (2020). [13] Buergel, T. et al. Metabolomic profiles predict individual multidisease outcomes. Nature Medicine 28, 2309–2320 (2022). [14] de Bruijn, R. F. & Ikram, M. A. Cardiovascular risk factors and future risk of Alzheimer’s disease. BMC medicine 12, 1–9 (2014). [15] Koene, R. J., Prizment, A. E., Blaes, A. & Konety, S. H. Shared risk factors in cardiovascular disease and cancer. Circulation 133, 1104–1114 (2016). [16] Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nature Communications 14, 604 (2023). [17] Thompson, D. J. et al. UK Biobank release and systematic evaluation of optimised polygenic risk scores for 53 diseases and quantitative traits. MedRxiv 2022–06 (2022). [18] Elliott, J. et al. Predictive accuracy of a polygenic risk score–enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA 323, 636–645 (2020). [19] Mars, N. et al. Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nature Medicine 26, 549–557 (2020). [20] Inouye, M. et al. Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. Journal of the American College of Cardiology 72, 1883–1893 (2018). [21] Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B. & Vasan, R. S. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in Medicine 27, 157–172 (2008). [22] Pencina, M. J., D’Agostino Sr, R. B. & Steyerberg, E. W. Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Statistics in Medicine 30, 11–21 (2011). [23] Altamirano, J. & Bataller, R. Cigarette smoking and chronic liver diseases. Gut 59, 1159–1162 (2010). [24] Sato, S. et al. Elevated serum tyrosine concentration is associated with a poor prognosis among patients with liver cirrhosis. Hepatology Research 51, 786–795 (2021). 15 [25] Morgan, M. Y., Marshall, A., Milsom, J. P. & Sherlock, S. Plasma amino-acid patterns in liver disease. Gut 23, 362–370 (1982). [26] Sunami, Y. NASH, fibrosis and hepatocellular carcinoma: Lipid synthesis and glutamine/acetate signaling. International Journal of Molecular Sciences 21, 6799 (2020). [27] Zakhari, S. Overview: how is alcohol metabolized by the body? Alcohol research & health 29, 245 (2006). [28] Parsons, R. E. et al. Independent external validation of the QRISK3 cardiovascular disease risk prediction model using UK Biobank. Heart 109, 1690–1697 (2023). [29] National Institute for Health and Care Excellence. Cardiovascular disease: risk assessment and reduction, including lipid modification (NICE guideline [NG238]) (2023). https://www.nice.org.uk/guidance/ng238/chapter/Recommendationsstatins-for-primaryprevention-of-cardiovascular-disease. Date accessed: 2024-04-30. [30] Odutayo, A. et al. Atrial fibrillation and risks of cardiovascular disease, renal disease, and death: systematic review and meta-analysis. BMJ 354 (2016). [31] World Health Organization. WHO reveals leading causes of death and disability worldwide: 2000-2019. World Health Organization (WHO) 1 (2020). [32] Gadd, D. A. et al. Blood protein assessment of leading incident diseases and mortality in the UK Biobank. Nature Aging 1–10 (2024). [33] You, J. et al. Plasma proteomic profiles predict individual future health risk. Nature Communications 14, 7817 (2023). [34] Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622, 329–338 (2023). [35] Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39, 1 (2011). [36] Broyden, C. G. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA Journal of Applied Mathematics 6, 76–90 (1970). [37] Fletcher, R. A new approach to variable metric algorithms. The Computer Journal 13, 317–322 (1970). [38] Goldfarb, D. A family of variable metric updates derived by variational means, v. 24. Mathematics of Computation 21–55 (1970). [39] Shanno, D. F. Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation 24, 647–656 (1970). [40] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2023). 16 [41] Fry, A. et al. Comparison of sociodemographic and health-related characteristics of UK Biobank participants with those of the general population. American Journal of Epidemiology 186, 1026– 1034 (2017). [42] Arnett, D. K. et al. 2019 ACC/AHA guideline on the primary prevention of cardiovascular disease: a report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. Circulation 140, e596–e646 (2019). [43] Allen, N. E. et al. Approaches to minimising the epidemiological impact of sources of systematic and random variation that may affect biochemistry assay data in UK Biobank. Wellcome Open Research 5 (2020). [44] Watts, E. L. et al. Hematologic markers and prostate cancer risk: a prospective analysis in UK Biobank. Cancer Epidemiology, Biomarkers & Prevention 29, 1615–1626 (2020). [45] Elliott, P. & Peakman, T. C. The UK Biobank sample handling and storage protocol for the collection, processing and archiving of human blood and urine. International Journal of Epidemiology 37, 234–244 (2008). [46] Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1 (2010). [47] Therneau, T. M. A package for survival analysis in R (2024). R package version 3.7-0. [48] Inoue, E. nricens: NRI for risk prediction models with time to event and binary response data. (2018). R package version 1.6. 17 a Comprehensive e interaction n modelling g by y machine e learning:: survivalFM Linear effects All potential interaction effects β ∈ Rd β ∈ Rd×d βi βj d b βi,j d Low-rank parameter matrix of the factor vectors Factorized parametrization P ∈ Rd×k βi,j ≈ pi , pj pi ≈ pj d k d Method d evaluation Standard d Cox x regression Time to event ~ β x SurvivalFM Time to event ~ β x + pi , pj xi xj 1≤i=j≤d c Disease e prediction n examples s in n the e UK K Biobank i)) Case e studies s with h various s data a modalities ii)) Clinicall example e – QRISK3 10-year onset of 10 different diseases 4 scenarios of predictors: 1) Standard risk factors 2) Standard risk factors Biochemistry & blood counts 3) Standard risk factors Metabolomics biomarkers 4) Standard risk factors Polygenic risk scores 10-year onset of cardiovascular disease Established cardiovascular disease predictors from QRISK3 Demographic chracteristics Blood pressure Body mass index Prevalent diseases Medications Cholesterol measures Figure 1: Method overview and evaluation examples. a) A machine learning survival analysis method, survivalFM, is developed to estimate linear and all pairwise interaction effects between predictor variables using factorized parametrization of the interaction terms βi,j ≈ pi , pj . d denotes the number of predictor variables and k is a hyperparameter defining the rank of the factorization of the interaction terms. The rank of the factorization is typically much lower than the number of predictor variables (k d), enabling computation of the interaction terms even in the presence of many input variables. b) The added value of incorporating comprehensive interaction terms using survivalFM is assessed by comparing the performance to the standard linear Cox proportional hazards regression. b) The performance is evaluated in various disease prediction examples: i) case studies with four different predictor sets, each applied to ten disease examples; ii) a clinical example using predictors from the QRISK3 cardiovascular disease (CVD) risk evaluation tool. 18 Figure 2: Comprehensive interaction modelling by survivalFM improves risk prediction performance across various diseases and data modalities. Comparison of the predictive performance of survivalFM to standard linear Cox proportional hazards regression in terms difference in concordance index (Δ C-index) and continuous net reclassification improvement (NRI). Results are shown for ten disease examples (y-axis) across four data modalities: a) standard risk factors (blue; included in all models), b) clinical biochemistry and blood counts (red), c) metabolomics biomarkers (orange) and d) polygenic risk scores (green). Horizontal error bars denote 95% confidence intervals (CIs), estimated with bootstrapping over 1000 resamples. Sample sizes and event counts for each disease example are provided in Supplementary Table 3. 19 SurvivalFM Standard Cox regression Alcoholic liver disease Alzheimer's disease Chronic kidney disease Liver fibrosis & cirrhosis Lung cancer 0.750 0.82 0.81 0.7775 0.80 0.7750 0.815 0.725 0.81 0.810 0.80 0.79 C−index 0.7725 0.79 0.78 0.77 100,000 200,000 300,000 0 Myocardial infarction 100,000 200,000 300,000 0.805 0.675 0.7700 0.78 0 0.700 0.800 0.650 0 Osteoporosis 100,000 200,000 300,000 0.795 0 Stroke 100,000 200,000 300,000 0 Type 2 diabetes 100,000 200,000 300,000 Vascular and other dementia 0.82 0.760 0.720 0.7625 0.804 0.81 0.755 0.7600 0.802 0.715 0.80 0.750 0.800 0.7575 0.710 0.745 0.79 0.798 0.7550 0 100,000 200,000 300,000 0 100,000 200,000 300,000 0 100,000 200,000 300,000 0 100,000 200,000 300,000 0 100,000 200,000 300,000 Number of individuals in training Figure 3: Comprehensive interaction modelling using survivalFM benefits from large training data sizes. Impact of the size of the training dataset (x-axis) on the discrimination performance as measured by concordance index (C-index; y-axis), comparing survivalFM (blue) to standard Cox regression (gray). Results are shown for the input dataset consisting of standard risk factors. Sample sizes and event counts for each example are provided in Supplementary Table 3. 20 Figure 4: Evaluation of survivalFM in a practical clinical cardiovascular risk prediction scenario involving predictors from QRISK3. Performance of the models trained considering QRISK3 predictors for composite cardiovascular disease prediction (N = 344 292 with complete data, 21 534 events). a) Discrimination performance evaluated using concordance index (C-index) for three models: standard Cox regression with linear terms, standard Cox regression with linear terms + age interaction terms from QRISK3 and survivalFM model with linear terms + all factorized pairwise interactions. b) Categorical net reclassification improvements (NRI) at 10% absolute risk threshold, as compared to standard Cox model with linear terms. Horizontal error bars denote 95% confidence intervals (CIs), estimated with bootstrapping over 1000 resamples. c) Reclassification plots showing how the inclusion of interaction terms in the more advanced models (y-axis, logarithmic scale) changes individual risk predictions, as compared to a standard linear Cox model (x-axis, logarithmic scale). Black dotted vertical and horizontal lines show the 10% absolute risk threshold for the high risk category. 21 Figure 5: Estimated model coefficients from survivalFM model, trained considering the risk factors from QRISK3 for cardiovascular disease risk prediction. The coefficients are shown as the average of the estimated coefficients across the ten models trained during the cross-validation. Estimated coefficients for a) the linear effects β and b) the interaction effects given by the inner product of the factor vectors βi,j = pi , pj . The dendrogram shows a hierarchical clustering of the interaction profiles, using Euclidean distance as the measure of similarity. 22 Business, Economy Art, Design, Architecture Science, Technology Crossover Doctoral Theses Aalto DT 8/2025 ISBN 978-952-64-2351-7 ISBN 978-952-64-2352-4 (pdf) Aalto University School of Science Department of Computer Science aalto.fi

Machine Learning for Precision Medicine Doctoral Thesis

Related documents

Products

Support

Machine Learning for Precision Medicine Doctoral Thesis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib