Computational Drug Repositioning by Machine Learning from EHR Data David Page Center for Predictive Computational Phenotyping University of Wisconsin-Madison EHR Data in a Relational Data Warehouse Patient ID P1 Gender M Birthdate 3/22/1963 Patient ID P1 P1 Date 1/1/2001 2/1/2001 Physician Smith Jones Symptoms palpitations fever, aches Patient ID P1 P1 Date 1/1/2001 1/9/2001 Lab Test blood glucose blood glucose Result 42 45 Patient ID P1 P2 Date 1/1/2001 1/9/2001 Observation Height BMI Result 5'11 34.5 Patient ID P1 Date Prescribed 5/17/1998 Date Filled 5/18/1998 Physician Jones Diagnosis hypoglycemic influenza Medication Prilosec Dose 10mg Duration 3 months Alternative View of Patient Data: Irregularly-Sampled Time Series But Most ML Algorithms Expect: • Single Table (Spreadsheet), or • Regularly-Sampled Time Series • Another Challenge: Most ML algorithms aim at accurate prediction, not causal discovery, and many potential confounders are unmeasured, and possibly not even considered High-Throughput ML (Kleiman, Bennett, et al., 2016) Predicting Every ICD Diagnosis Code at the Press of a Button Extending SCCS to Numerical Response (Kuang et al.) Electronic Health Records (EHRs) Properties Person 1 Longitudinal 200 160 150 Diabetes Drug 1 Observational 120 Drug 2 30 60 Age Adverse Drug Person 2 Reaction (ADR) discovery Bleeding 90 104 Applications 100 Drug 2 Age Computational Drug Repositioning (CDR) A Critical Intuition: Underlying Baseline Person 1 Diabetes Drug 2 30 60 Age Person 2 Bleeding 30 60 Age Baseline: Blood sugar level under no influence of any drugs. Fixed Effect Model Person 1 Diabetes Insulin 30 Insulin 45 60 Age Fixed Effect Model (Frees, 2004): yij | xij dimβ =# drugs ⊤ = αi + β xij + ϵij, ϵij ∼ N(0,σ2). Time-Varying Baseline Person 1 Diabetes Insulin 30 Insulin 45 60 Age Time-Varying Baseline, add regularization to minimize change in proximal, consecutive tij values: ⊤ yij | xij = tij + β xij + ϵij, ϵij ∼ N(0,σ2). Time-Varying Baseline Person 1 Diabetes Insulin 30 Insulin 45 60 Age Time-Varying Baseline: add regularization to minimize change in proximal, consecutive tij values: ⊤ yij | xij = tij + β xij + ϵij, ϵij ∼ N(0,σ2). Last term penalizes consecutive baseline values that are close together in time but far apart in numerical value. Ground Truth for Drugs that are Glucose Lowering 1.00 1.00 Methods ABR CSCCSA 0.75 BR CSCCS 0.75 0.50 0.25 0.00 0.50 Methods ABR CSCCSA BR CSCCS 8 0.25 16 24 K 32 40 0.00 0.25 0.5 0.75 1 FPR Threshold Figure: Left: Precision at K among the top-forty drugs generated by the four models; Right: Partial AUCs on the top-forty drugs generated by the four models. Sample size: 219306. Number of drug candidates: 2980. Recovery of Known Glucose Lowering Agents INDX CODE 1 4485 2 7470 3 8437 4 4837 5 6382 6 4171 7 4106 8 160 9 824 10 9152 11 4132 12 4184 13 4170 14 4208 15 416 16 4107 17 844 18 2830 19 4806 20 5787 21 2824 22 5786 23 7731 24 1760 25 4497 26 9889 27 4813 28 4133 29 6445 30 6656 31 9379 32 1636 33 1218 34 8025 35 8316 36 9136 37 4802 38 7674 39 4804 40 1200 DRUG NAME HUMALOG PIOGLITAZONE HCL ROSIGLITAZONE MALEATE INSULN ASP PRT/INSULIN ASPART NEEDLES INSULIN DISPOSABLE GLUCOTROL XL GLIMEPIRIDE ACTOS AVANDIA SYRING W-NDL DISP INSUL 0.5ML GLUCOPHAGE GLYBURIDE GLUCOTROL GLYNASE AMARYL GLIPIZIDE AXID DILTIAZEM INSULIN GLARGINE HUM.REC.ANLOG METFORMIN HCL DILAUDID METFORMIN PRAVACHOL CELEXA HUM INSULIN NPH/REG INSULIN HM URSODIOL INSULIN NPL/INSULIN LISPRO GLUCOPHAGE XR NEURONTIN NPH HUMAN INSULIN ISOPHANE THIAMINE HCL CARDURA BLOOD SUGAR DIAGNOSTIC DRUM PROZAC REZULIN SYRINGE & NEEDLE INSULIN 1 ML INSULIN POTASSIUM CHLORIDE INSULIN ASPART BLOOD-GLUCOSE METER SCORE COUNT -11.786 124 -10.220 3075 -9.731 1019 -9.658 258 -9.464 2827 -8.117 2853 -7.940 3384 -7.721 1125 -6.802 1239 -6.623 4186 -6.322 6736 -6.021 8879 -5.721 1259 -5.670 591 -5.599 2240 -5.563 9993 -4.682 189 -4.297 1021 -4.175 4213 -4.147 19584 -4.076 39 -3.890 3838 -3.532 1700 -3.517 1473 -3.501 1829 -3.132 376 -2.972 623 -2.845 765 -2.615 1418 -2.500 2874 -2.383 341 -2.198 1079 -2.073 2593 -2.037 1525 -1.895 444 -1.885 3542 -1.812 1526 -1.779 9842 -1.752 2476 -1.719 5289 INDX CODE 1 4802 2 8316 3 824 4 416 5 5226 6 5789 7 4485 8 4132 9 4811 10 144 11 1389 12 4171 13 9155 14 4116 15 6652 16 160 17 6646 18 4813 19 8437 20 4170 21 9889 22 5052 23 4118 24 2521 25 7470 26 5786 27 10366 28 4500 29 4172 30 7471 31 6382 32 4184 33 4208 34 4210 35 4163 36 5977 37 7946 38 1602 39 1305 40 6657 DRUG NAME INSULIN REZULIN AVANDIA AMARYL LANTUS METFORMIN HYDROCHLORIDE HUMALOG GLUCOPHAGE INSULIN NPH ACTIGALL CAL GLUCOTROL XL SYRNG W-NDL DISP INSUL 0.333ML GLUCAGON NOVOLOG ACTOS NOVOFINE 31 INSULIN NPL/INSULIN LISPRO ROSIGLITAZONE MALEATE GLUCOTROL URSODIOL KAY CIEL GLUCAGON HUMAN RECOMBINANT DARVOCET-N PIOGLITAZONE HCL METFORMIN ZINC SULFATE HUMULIN GLUCOVANCE PIOGLITAZONE HCL/METFORMIN HCL NEEDLES INSULIN DISPOSABLE GLYBURIDE GLYNASE GLYSET GLUCOSE MINIPRESS PROPANTHELINE CARBOCAINE BUDEPRION SR NPH INSULIN SCORE COUNT 47 635 50 120 59 449 65 503 66 33 75 10 81 63 86 1813 88 19 90 34 90 45 90 701 95 29 97 121 98 51 106 480 106 31 108 118 109 332 113 641 113 123 114 23 115 227 116 11 121 705 125 2149 130 34 135 33 136 115 136 16 137 649 144 1354 145 115 148 7 159 1778 163 19 163 6 182 63 185 9 185 43 Conclusion • Impossible to perfectly infer causation from purely observational data such as EHR • But real applications are often about causation • Empirical results suggest can do this well for some drugs and some frequent lab measurements • Many other potential applications – Treatment response and adverse drug events – Other biological tasks – Other causal discovery tasks Thanks! • • • • • • • NIH BD2K Program NLM, NIGMS, NIAID Charles Kuang Peggy Peissig Michael Caldwell James Thomson Ron Stewart