Risk modeling of colorectal cancer using machine learning algorithms on a hybridized genealogical and clinical dataset David P. Taylor, MS 1,2, Lisa A. Cannon-Albright, PhD 1, Randall W. Burt, MD 1,3, Jason P. Jones, PhD 1,2, 1 1,2 1,2 Carol Sweeney, PhD , Marc S. Williams, MD , Peter J. Haug, MD 1 University of Utah, Salt Lake City, UT; Introduction Although colorectal cancer (CRC) is the third most common cancer diagnosed in the United States and the second leading cause of death among cancers, a majority of US adults are not being screened regularly or appropriately. Knowledge of increased risk can be a motivating factor in deciding to be screened. Family history is a wellestablished risk factor for CRC and the incorporation of family history in electronic health records (EHRs) in a structured format is gaining momentum.1 A comprehensive CRC risk model that includes a variety of risk/protective factors has been developed.2 However, data were generally self-reported and family history of CRC was limited to affected first-degree relatives (FDRs). In an ongoing research project, we have investigated risk from combinations of affected first-, second-, and third-degree relatives (FDRs, SDRs, and TDRs, respectively) in more than 2.3 million probands in the Utah Population Database (UPDB), a population-based, electronic, genealogical resource that also contains cancer registry records.3 Many individuals in the UPDB also have clinical records in the Intermountain Healthcare system. Clinical data for particular CRC risk factors are available electronically starting in 1993, however, the availability is sporadic until the 20002002 time frame. Our goal is to build a risk model for CRC that includes objective clinical and family history data, and in particular, to compare the family history component of risk with the component from available clinical/behavioral factors. An important outcome of this work will be to determine if typically available, electronic, clinical and administrative data can improve the predictive power of a risk model beyond that obtained using just family history, and if so, how the model may be generalized to other settings. Contact Information David Taylor d.p.taylor@utah.edu 2 3 Intermountain Healthcare, Salt Lake City, UT; Influence of Family History on CRC Risk Number of Number of Number of affected affected affected FDRs SDRs TDRs In an analysis of 2,327,327 individuals included in ≥3 generation family histories, 10,556 had a diagnosis of CRC. Increased numbers of affected FDRs influences risk much more than affected SDRs or TDRs. However, when combined with a positive first-degree family history, a positive second- and third-degree family history can significantly increase risk. Age at diagnosis of CRC in affected relatives contributes significantly to risk estimates. Even diagnosis between 60 and 69 years of age in affected FDRs increases risk equivalent to the level of an affected FDR without respect to age at diagnosis. Huntsman Cancer Institute, Salt Lake City, UT N (probands) Familial relative risk (FRR) Lower CI (95% level) Upper CI (95% level) 0 0 0 1,470,367 0.83 0.81 0.86 0 1 2 20,321 1.33 1.13 1.55 1 NA NA 87,089 1.91 1.82 2.00 ≥1 NA NA 94,931 2.05 1.96 2.14 ≥1 (dx age <50) NA NA 6291 3.31 2.79 3.89 ≥1 (dx age 60-69) NA NA 25,084 2.22 2.04 2.40 1 1 0 8836 1.88 1.59 2.20 1 1 ≥3 1357 3.28 2.44 4.31 2 NA NA 6966 3.01 2.66 3.38 Mother and father both affected NA NA 450 4.97 2.72 8.34 3 NA NA 762 4.43 3.24 5.90 Table 1: Selected familial relative risks (FRR) for constellations of affected FDRs, SDRs, and TDRs. Data included in analysis from 1993 until reference date (most data are from 1998 and on) 1 Select cases Must have ≥1 visit/stay documented in this time period Birth (+/- 1 year) Case CRC Dx Build and 5 test models Control • Seen as an inpatient or outpatient at Intermountain at least once during the years 1996-2000, at age 18 or older. • Member of a UPDB dataset containing individuals part of ≥3 generation family history • Diagnosed (for the first time) with CRC ≥ year 2000 • CRC site/histology codes specified by Utah Cancer Registry 1996 2000 6 months prior to Dx (reference date) 1 year prior to Dx Figure 1: Example timelines for a case and matched control included in dataset 2 Select controls Gather clinical data 3 •Same criteria as cases (except diagnosis) Matching criteria: • +/- 1 year on birth year • Sex • Has to have visit/stay 1 year prior to case Dx date or anytime after • 1:10 case/control matching ratio Collect and aggregate 4 family history data •Exposure data ending at a ‘Reference’ date 6 months before the diagnosis date for each case, and at the same reference date for controls matched to that case • Data gathered: other diagnoses, medications, BMI, screening procedures, tobacco and alcohol use Selected Preliminary Results Presence in controls (%) Year data are first available 35% Colonoscopy 4.4 9.5 1997 30% Flexible sigmoidoscopy 2.7 4.3 1997 25% FOBT 7.5 10.1 1995 Adenoma 2.2 2.4 1999 • Collect CRC and other cancer histories for all known FDRs, SDRs, and TDRs of cases and controls • Tally counts of affected relatives per case/control by cancer • Prepare dataset for analysis • Withhold part of dataset for testing or utilize Bootstrap or cross-validation • Compare performance of models using Area Under the Curve (AUC) of an ROC 1.5 1.1 •Documented medication order does not mean the medication was taken by patient •Not able to determine lifetime exposure from available dataset, which represent 1993 and later Despite the challenges we are not able to mitigate, we hope the high-quality family history data as well as screening and disease diagnosis data may still provide valuable insights on CRC risk, with further refinement of research objectives, study design, and the data analyzed. Future Work •Determine whether expanded set of cases and matched controls improves performance •Use familial relative risk estimates, current ages, and local morbidity and mortality rates, in a sample of patients to estimate absolute risks of acquiring CRC over a period of time • Estimate gaps in risk-appropriate screening based on review of EHR in “relatives” sample References All 0 aff FDRs ≥1 aff FDR 20% 15% Advanced adenoma •Other missing/incomplete data (e.g., patient might be seen outside Intermountain; incomplete documentation such as tobacco/alcohol or aspirin use; can’t distinguish “no” from “unknown” status for many variables) 40% Presence in cases (%) Finding Control must have visit/stay after this date •First CRC Dx date must be > year 2000 •Erroneous data (e.g., BMI out of possible range, duplicate records). (Known erroneous data disregarded and only individuals with single identifier included in study.) •Limited contribution of individual risk/protective factors on CRC risk Methods for Model Incorporating Clinical Variables and Family History Case Challenges 1999 10% IBD 5.2 5.2 1994 Diabetes 18.8 14.1 1994 5% Tobacco use 2.9 1.6 2004 0% Colonoscopy Statin order 8.6 7.7 1997 HRT order 4.4 3.2 1997 Table 2: Comparisons between 789 cases and 7886 controls for presence of selected findings in the EHR. Flex Sig FOBT Adenoma Advanced Adenoma Figure 2: Percentages of individuals from a sample of 128,927 relatives of cases and controls, not known to be deceased, between 50 and 100 years of age, not known to have CRC, with evidence of having been screened at least once and (separately) evidence of adenomas in the EHR. Chart shows percentages for all individuals, those with 0 affected FDRs, and those with ≥1 affected FDR. 1. Feero WG, Bigley MB, Brinner KM, et.al. New Standards and Enhanced Utility for Family Health History Information in the Electronic Health Record: An Update from the American Health Information Community's Family Health History Multi-Stakeholder Workgroup. J Am Med Inform Assoc. 2008 Nov-Dec;15(6):723-8. 2. Freedman AN, Slattery ML, Ballard-Barbash R, et al. Colorectal cancer risk prediction tool for white men and women without known susceptibility. J Clin Oncol. 2009 Feb 10;27(5):686-93. 3. Taylor DP, Burt RW, Williams MS, Haug PJ, Cannon-Albright LA. Population-based family-history-specific risks for colorectal cancer: a constellation approach. Gastroenterology. (In press)