Training Data (4 treatments) FGFR1/3i AKTi AKTi+MEKi DMSO All Data (N treatments) Participants infer 32 networks using training data Test Data (N-4 treatments) Test1 Test2 …. Test(N-4) Inferred networks assessed using test data • No definitive “gold standard” causal networks • Use a novel held-out validation approach, emphasizing causal aspect of challenge Assessment: How well do inferred causal networks agree with effects observed under inhibition in test data? Step 1: Identify “gold standard” with a paired t-test to compare DMSO and test inhibitors for each phosphoprotein and cell line/stimulus regime Phospho2 (a.u.) phospho1 (a.u.) e.g. UACC812/Serum, Test1 DMSO p-value = 3.2x10-5 Test1 DMSO p-value = 0.45 Test1 time time phosphoproteins Test1 0 1 1 0 1 0 0 1 0 0 “gold standard” Step 2: Score submissions 0.67 ⋮ 0.58 Matrix of predicted edge scores for a single cell line/stimulus regime ⋯ 0.43 ⋱ ⋮ ⋯ 0.87 threshold, τ 1 ⋮ 0 Test1 Obtain protein descendants downstream of test inhibitor target ⋯ 0 ⋱ ⋮ ⋯ 1 phosphoproteins Test1 1 0 1 0 1 FP 0 1 TP TP 1 FP TP Compare descendants of test inhibitor target to “gold standard” list of observed effects in held-out data #TP(τ), #FP(τ) Vary threshold τ ROC curve and AUROC score # TP AUROC # FP 0 0 • 74 final submissions • Each submission has 32 AUROC scores (one for each cell line/stimulus regime) 3.58 x 10-6 4.18 x 10-6 non-significant AUROC significant AUROC best performer 8.98 x 10-6 9.19 x 10-4 1. For each submission and each cell line/stimulus pair, compute AUROC score 32 cell line/stimulus pairs Submissions Scoring procedure: 0.5 0.7 0.9 0.6 0.5 0.8 0.7 0.4 0.7 0.6 AUROC 0.8 scores 0.5 4. Mean rank across cell line/stimulus pairs calculated for each submission Rank submissions according to mean rank 3 2 mean 1.33 rank 3.66 4 2 1 3 3 1 2 4 2 3 1 4 Submissions 3. Submissions ranked for each cell line/stimulus pair Submissions 2. Submissions 32 cell line/stimulus pairs 3 2 1 4 final rank AUROC ranks • Verify that final ranking is robust Procedure: 1. Mask 50% of phosphoproteins in each AUROC calculation 2. Re-calculate final ranking 3. Repeat (1) and (2) 100 times rank 5.40 x 10-10 Top 10 teams phosphoproteins • Gold-standard available: Data-generating causal network ER-alpha_pS118 HER2_pY1248 EGFR_pY1173 Src_pY416 PKC-alpha_pS657 Src_pY527 S6_pS235_S236 p38_pT180_Y182 Rb_pS807_S811 C-Raf_pS338 p27_pT198 p90RSK_pT359_S363 • Participants submitted a single set of edge scores MEK1_pS217_S221 JNK_pT183_pT185 GSK3-alpha-beta_pS21_S9 p70S6K_pT389 S6_pS240_S244 MAPK_pT202_Y204 AMPK_pT172 Bad_pS112 • Edge scores compared against gold standard -> AUROC score Akt_pS473 mTOR_pS2448 STAT3_pY705 PRAS40_pT246 4E-BP1_pS65 PDK1_pS241 ACC_pS79 YAP_pS127 • Participants ranked based on AUROC score 3.11 x 10-11 Robustness Analysis: 1. Mask 50% of edges in calculation of AUROC 2. Re-calculate final ranking 3. Repeat (1) and (2) 100 times non-significant AUROC (51) rank 3.90 x 10-14 significant AUROC (14) best performer Top 10 teams • 59 teams participated in both SC1A and SC1B • Reward for consistently good performance across both parts of SC1 • Average of SC1A rank and SC1B rank • Top team ranked robustly first Training Data (4 treatments) FGFR1/3i AKTi AKTi+MEKi DMSO All Data (N treatments) Test Data (N-4 treatments) Test1 Test2 …. Test(N-4) Participants build dynamical models using training data and make predictions for phosphoprotein trajectories under inhibitions not in training data Predictions assessed using test data • Participants made predictions for all phosphoproteins for each cell line/stimulus pair, under inhibition of each of 5 test inhibitors • Assessment: How well do predicted trajectories agree with the corresponding trajectories in the test data? • Scoring metric: Root-mean-squared error (RMSE), calculated for each cell line/phosphoprotein/test inhibitor combination e.g. UACC812, Phospho1, Test1 RMSE r p , c ,i 1 T S r 2 ˆ ( x x ) p,c,i,s,t p,c,i,s,t TS t 1 s 1 • 14 final submissions 1.35 x 10-4 3.70 x 10-8 non-significant AUROC significant AUROC best performer 1.49 x 10-5 1.21 x 10-6 Final ranking: Analogously to SC1A, submissions ranked for each regime and mean rank calculated • Verify that final ranking is robust Procedure: 1. Mask 50% of data points in each RMSE calculation 2. Re-calculate final ranking 3. Repeat (1) and (2) 100 times 3.04 x 10-18 rank 6.97 x 10-5 Incomplete submission 0.99 2 best performers Top 10 teams • Participants made predictions for all phosphoproteins for each stimulus regime, under inhibition of each phosphoprotein in turn 1 T S r • Scoring metric is RMSE and procedure r RMSE p ,i ( xˆ p ,i ,s ,t x p ,i ,s ,t )2 TS t 1 s 1 follows that of SC2A 0.015 1.68 x 10-14 2.89 x 10-7 non-significant AUROC significant AUROC best performer 7.71 x 10-19 rank 1.0 Robustness Analysis: 1. Mask 50% of data points in each RMSE calculation 2. Re-calculate final ranking 3. Repeat (1) and (2) 100 times Incomplete submission Top 10 teams 0.99 • 10 teams participated in both SC2A and SC2B • Reward for consistently good performance across both parts of SC2 • Average of SC2A rank and SC2B rank • Top team ranked robustly first • 14 submissions • 36 HPN-DREAM participants voted – assigned ranks 1 to 3 • Final score = mean rank (unranked submissions assigned rank 4) • Submissions rigorously assessed using held-out test data: • SC1A: Novel procedure used to assess network inference performance in setting with no true “gold standard” • Many statistically significant predictions submitted For further investigation: • Explore why some regimes (e.g. cell line/stimulus pairs) are easier to predict than others • Determine why different teams performed well in experimental and in silico challenges • Identify the methods/approaches that yield the best predictions • Wisdom of crowds – does aggregating submissions improve performance and lead to discovery of biological insights?