Infectious Disease Informatics: Overview and The BioPortal Experience Hsinchun Chen, Ph.D. Artificial Intelligence Lab, U. of Arizona NSF BioPortal Research Center Acknowledgement: NSF, CIA, DHS, CDC, NCI 2016/7/12 1 Medical Informatics: The computational, algorithmic, database and information-centric approach to the study of medical and health care problems. Infectious Disease Informatics: Medical informatics for infectious disease, public health, and biodefence. Hsinchun Chen et al., 2005 Hsinchun Chen, et al., 2010 IDI and Syndromic Surveillance Systems Data sources and collection strategies Formal-informal (sequence to epi), standards, data entry and transmission, security Data analysis and outbreak detection Syndromic classification, outbreak detection methods (temporal, spatial, spatial-temporal), multiple data streams Data visualization, information dissemination, and alerting GIS, temporal, sequence, text, interactive System assessment and evaluation Algorithms, data collection, information dissemination, interface, usability 2016/7/12 3 Syndromic Surveillance Data Sources in Different Stages of Developing a Disease Reaching Situational Awareness 2016/7/12 Reproduced from Mandl et. al. (2004) 4 Syndromic Surveillance Systems Generation 1, paper-based: paper, fax, TEL, TEL directory, etc. Generation 2, email-based: email, Word/Access, pager, cell phone, etc. Generation 3, database-driven: database, standards, messaging, tabulation, GIS, graphs, text, etc. Generation 4, search engine-based: realtime, interactive, web services, visualized, GIS, graphs, texts, sequences, contact networks, etc. 2016/7/12 5 Syndromic Surveillance System Survey Projects User population Stakeholders RODS -Pennsylvania, Utah, Ohio, New Jersey, Michigan etc RODS laboratory, -418 facilities connected to RODS U of Pittsburgh STEM N/A IBM ESSENCE II 300 world wide DOD medical facilities DoD EARS -Various city, county, and state public health officials in the United States and abroad of US CDC BioSense Various city, county, and state public health officials in the United States and abroad of US CDC RSVP Rapid Syndrome Validation Project; Kansas, NM Sandia NL, NM BioPortal NY, CA, Kansas, AZ, Taiwan U of Arizona 2016/7/12 6 Sample Systems and Data Sources Utilized Projects Data sources/Techniques RODS - Chief complaints (CC); OTC medication sales - Free-text Bayesian disease classification STEM - Simulated disease data - Disease modeling and visualization, SIR ESSENCE II - Military ambulatory visits; CC; Absenteeism data EARS - 911 calls; CC; Absenteeism; OTC drug sales - Human-developed CC classification rules BioSense - City/state generated geocoded clinical data - Graphing/mapping displays RSVP - Clinical and demographic data - PDA entry and access BioPortal - Geo-coded clinical data; Gemonic sequences; Multilingual CC - Real-time access and visualization; Web based hotspot analysis; Sequence visualization; Multilingual ontologybased CC classification 2016/7/12 7 COPLINK System 8 COPLINK News •The New York Times November 2, 2002 •ABC News April 15, 2003 •Newsweek Magazine March3, 2003 Dark Web System Dark Web News Project Seeks to Track Terror Web Posts, 11/11/2007 Researchers say tool could trace online posts to terrorists, 11/11/2007 Mathematicians Work to Help Track Terrorist Activity, 9/14/2007 Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007 BioPortal: Overview, West Nile Virus (real-time information collection, sharing, access, visualization, and analysis, Epi data across species) 2016/7/12 12 BioPortal Project Goals Demonstrate and assess the technical feasibility and scalability of an infectious disease information sharing (across species and jurisdictions), alerting, and analysis framework. Develop and assess advanced data mining and visualization techniques for infectious disease data analysis and predictive modeling. Identify important technical and policy-related challenges in developing a national infectious disease information infrastructure. 2016/7/12 13 Information Sharing Infrastructure Design 2016/7/12 Data Ingest Control Module Cleansing / Normalization Adaptor Adaptor SSL/RSA Adaptor SSL/RSA Info-Sharing Infrastructure Portal Data Store (MS SQL 2000) PHINMS Network XML/HL7 Network NYSDOH CADHS New 14 Data Access Infrastructure Design Public health professionals, researchers, policy makers, law enforcement agencies & other users WNV-BOT Portal Browser (IE/Mozilla/…) Data Search and Query SpatialTemporal Visualization SSL connection Analysis / Prediction HAN or Personal Alert Management Web Server (Tomcat 4.21 / Struts 1.2) Data Store User Access Control API (Java) 2016/7/12 Dataset Privileges Management Data Store (MS SQL 2000) Access Privilege Def. 15 Spatial-Temporal Visualization Integrates four visualization techniques GIS View Periodic Pattern View Timeline View Central Time Slider Visualizes the events in multiple dimensions to identify hidden patterns 2016/7/12 Spatial Temporal Hotspot analysis Phylogenetic tree Contact network analysis 16 BioPortal Prototype Systems 2016/7/12 17 Outbreak Detection & Hotspot Analysis Hotspot is a condition indicating some form of clustering in a spatial and temporal distribution (Rogerson & Sun 2001; Theophilides et. al. 2003; Patil & Tailie 2004; Zeng et. al. 2004; Chang et. al. 2005) For WNV, localized clusters of dead birds typically identify high-risk disease areas (Gotham et. al. 2001); automatic detection of dead bird clusters can help predict disease outbreaks and allocate prevention/control resources effectively 2016/7/12 18 Retrospective Hotspot Analysis Problem Statement 2016/7/12 19 Risk-Adjusted Support Vector Clustering (RSVC) Feature space Minimum sphere High baseline density makes two points far apart in feature space Estimate baseline density 2016/7/12 20 Study II: NY WNV (birds, mosquitoes, and humans) 140 records March 5 224 records May 26 baseline July 2 new cases On May 26, 2002, the first dead bird with WNV was found in NY Based on NY’s test dataset 2016/7/12 21 Dead Bird Hotspots Identified 2016/7/12 22 Dataset name Advanced Spatial / Temporal Search criteria Select background maps Results listedlist in table Available dataset User main page Positive cases Time range Select NY / CA population, river and lakes County / State Choose WNV disease data Select CA dead bird, chicken and NY dead bird data Positive cases User Login Positive cases Start STV 2016/7/12 Specify bird species 23 NYSpatial dead bird distribution temporal distribution pattern pattern GIS Timeline Close Zoom in NY Zoom in Periodic Pattern Year 2001 data Control panel 2016/7/12 Move time slider, year 3 2 2 weeks View1 all year 3 year window window indata 3 year span Concentrated Similar time Overall pattern in May pattern / Jun 24 Spatial distribution Overlay population map pattern Dead bird cases Dead birdlong cases migrate from island distribute along Into upstate NY populated areas near Hudson river Enable population map Season end 2016/7/12 Move time slider 25 BioPortal HotSpot Analysis: RSVC, SaTScan, and CrimeStat Integrated (first visual, realtime hotspot analysis system for disease surveillance) West Nile virus in California 2016/7/12 26 Hotspot Analysis-Enabled STV Select hotspot Regular STV to highlight Selectcase points algorithms Hotspots found! Select baseline and case periods Select Select baseline targetand geographic case periods area 2016/7/12 27 BioPortal – FMD (many species; phylogenetic tree and news) 2016/7/12 28 FMD Global Surveillance: Lessons Learned Must understand risks, and nature of changing risks, in order to develop strategies for prevention and mitigation on a global scale Must understand the global situation in order to prepare locally United Kingdom FMD outbreak, 2001; $12B, 5060% of 4M farm animals (cows, pigs, sheep) slaughtered 2016/7/12 29 International FMD BioPortal Real time web-based situational awareness of FMD outbreaks worldwide through the establishment of an international information technology system. FMDv characterization at the genomic level integrated with associated epidemiological information and modeling tools to forecast national, regional and/or international spread and the prospect of importation into the US and the rest of North America. Web-based crisis management of resources—facilities, personnel, diagnostics, and therapeutics. 2016/7/12 30 Preliminary Global FMD Dataset Provider: UC Davis FMD Lab Information sources: reference labs and OIE Coverage: 28 countries globally Dataset size: 30,000+ records of which 6789 records are complete Host species: Cattle, Caprine, Ovine, Bovine, Swine, NK, Elephant, Buffalo, Sheep, Camelidae, Goat Regionwise Distribution of FMD Data Europe 14% Africa 1% Middle East Asia 4% Central and South Asia 15% Buffaloes Elephant 3% 0% Sheep Goats 0% 3% Camelidae Cattle 0% 5% Caprine 4% Swine 11% Ovine 37% South America 2016/7/12 66% Bovine 37% 31 Global FMD Coverage in BioPortal 2016/7/12 32 FMD Migration Visualization using BioPortal (cases in South Asia) FMD Cases travel back and forth between countries 2016/7/12 33 BioPortal-Afghanistan 2016/7/12 34 International FMD News Provider: UC Davis FMD Lab Information sources: Google, Yahoo, and open Internet sources Time span: Oct 4, 2004 – present (realtime messaging under development) Data size: 460 events (6/21/05) Coverage: 51 countries (Africa:11, Asia:16, Europe:12, Americas:12) UNDEFINED 5% Africa 11% Aisa 1% America 27% Asia 15% Australia 14% 2016/7/12 Europe 27% 35 Searching FMD News http://fmd.ucdavis.edu/ Searchable by 2016/7/12 Date range Country Keyword 36 Visualizing FMD News on BioPortal 2016/7/12 37 FMD Genetic Visualization Goal: Extend STV to incorporate 3rd dimension, phylogenetic distance Include a phylogenetic tree. Identify phylogenetic groups and color-code the isolate points on the map. Leverage available NCBI tools such as BLAST. Proof of concept: SAT 2 & 3 analysis Data: 54 partial DNA sequence records in South Africa received from UC Davis FMD Lab (Bastos,A.D. et al. 2000, 2003) Date range: 1978-1998 Countries covered: South Africa, Zimbabwe, Zambia, Namibia, Botswana 2016/7/12 38 Sample FMD Sequence Records Color-coded View (MEGA3) Textual View of Gene Sequence 2016/7/12 39 Phylogenetic Tree of Sample FMD Data Group6 Identify 6 groups within 2 major families (MEGA3; based on sequence similarity) Group1 Group5 Group2 Group4 2016/7/12 Group3 40 Genetic, Spatial, and Temporal Visualization of FMD Data Phylogenetic tree color coded Isolates’ locations color coded Isolates’ appearances in time 2016/7/12 41 FMD Time Sequence Analysis First family cases appeared throughout the period 2nd family cases exist before 1993 and a comeback lately Second family cases existed before 1993 and reappeared later after 1997 2016/7/12 42 FMD Periodic Pattern Analysis 2nd family concentrated in Feb. while 1st family spread evenly 2016/7/12 43 Locations of Family 1 records Selected only groups 1, 2, and 3 and found a spatial cluster 2016/7/12 44 Locations of Family 2 records Sparse isolate locations Selected only groups 4, 5, and 6 2016/7/12 45 BioPortal: Influenza, SARS (chief complaint syndromic surveillance, contact network analysis and visualization) 2016/7/12 46 Existing CC Classification Methods Classification Method Systems Authors Keyword Match + Synonym List + Mapping Rules DOHMH (NY City), EARS Mikosz et. al. (2004) Weighted Keyword Match (Vector Cosine Method) + Mapping Rules ESSENCE Sniegoski (2004) Naïve Bayesian Bayesian Network 2016/7/12 RODS N/A Olszewski (2003), Ivanov et. al (2002) Chapman et. al. (2004) 47 Syndromic Categories in Different Systems System Details # Sdms CDC (2002) 11 Botulism, Hemorrhagic, Lymphadenitis, Cutaneous Lesion, Gastrointestinal, Respiratory, Neurological, Rash, Specific Infection, Fever, Severe Illness or Death EARS 41 Lower Resp., Upper Resp., Neuro, Febrile, Poison, Hemorrhage, Botulinic, Rash, Fever, etc. (41 categories) RODS 8 Gastrointestinal, Constitutional, Respiratory, Rash, Hemorrhagic, Botulinic, Neurological, Other ESSENCE 8 Death, Gastr, Neuro, Rash, Respi, Sepsi, Unspe, Other 48 2016/7/12 Overall System Design Stage 1 Stage 2 Chief Complaints symptoms CC Standardization Symptom Grouping Weighted Semantic Similarity Score EMT-P UMLS Ontology UMLS Concepts 2016/7/12 Synonym List Stage 3 Symptom Grouping Table EMT-P EARS Symptom Table Symptom Groups Syndromes Syndrome Classification JESS EARS Syndrome Rules 49 Comparing BioPortal to RODS Trained BioPortal Syndrome TP+FN PPV Sensitivity Specificity F F2 GI 124 91.41% 94.35%*** 98.74% 0.93*** 0.93*** HEMO 30 82.86% 96.67%*** 99.38% 0.89** 0.92*** RASH 15 66.67% 66.67%** 99.49% 0.67* 0.67** RESP 110 92.08% 84.55%**** 99.10% 0.88*** 0.87*** UPPER_RESP 43 80.43% 86.05% 99.06% 0.83 0.84 RODS Syndrome TP+FN PPV Sensitivity Specificity F F2 GI 124 89.89% 64.52% 98.97% 0.75 0.71 HEMO 30 90.91% 66.67% 99.79%* 0.77 0.73 RASH 15 58.33% 46.67% 99.49% 0.52 0.50 RESP 110 87.84% 59.09% 98.99% 0.71 0.66 UPPER_RESP 43 N/A N/A N/A N/A N/A 2016/7/12 * p-value < 0.1 ** p-value < 0.05 Statistical test is based on 2,500 bootstrapings. *** p-value < 0.01 50 Comparing BioPortal to EARS Trained BioPortal Syndrome TP+FN PPV Sensitivity Specificity F F2 GI 124 91.41% 94.35%*** 98.74% 0.93*** 0.93*** HEMO 30 82.86% 96.67%*** 99.38% 0.89*** 0.92*** RASH 15 66.67% 66.67%** 99.49% 0.67 0.67* RESP 110 92.08% 84.55%**** 99.10% 0.88*** 0.87*** UPPER_RESP 43 80.4%*** 86.05%*** 99.06%*** 0.83*** 0.84*** EARS Syndrome TP+FN PPV Sensitivity Specificity F F2 GI 124 93.75%* 72.58% 99.32%*** 0.82 0.78 HEMO 30 100.00%*** 33.33% 100.00%*** 0.50 0.43 RASH 15 70.00% 46.67% 99.70% 0.56 0.53 RESP 110 90.36% 68.18% 99.10% 0.78 0.74 UPPER_RESP 43 58.70% 62.79% 98.01% 0.61 0.61 2016/7/12 * p-value < 0.1 ** p-value < 0.05 Statistical test is based on 2,500 bootstrapings. *** p-value < 0.01 51 Chinese CC Preprocessing: System Design English Expressions Stage 0.1 Chinese Chief Complaints Stage 0.2 Chinese Separate Chinese Chinese and Expressions Phrase English Segmentation Expressions Chinese Medical Phrases Raw Chinese CCs 2016/7/12 Segmented Chinese Phrases Common Chinese Phrases Stage 0.3 Chinese Phrase Translation Translated Chinese Phrases Chinese to English Dictionary Mutual Info. 52 Group by Hospital 2016/7/12 53 Group by Syndrome Classification 2016/7/12 54 Taiwan SARS Contact Network Visualization Social network visualization with patients and geographical locations Scroll bar on time dimension to see the evolution of a network 2016/7/12 55 Taiwan SARS Network Evolution – Hospital Outbreak The index patient of Heping Hospital began to have symptoms. 2016/7/12 56 BioPortal: Towards building integrated, real-time situation awareness for syndromic surveillance and biodefense 2016/7/12 57 Syndromic Surveillance vs. BioSurveillance (Species Jumps) Data sources and collection strategies Levels of data (sequence to epi to social media), data granularity, automated/manual, integration, information sharing (incentives and standards) Data analysis and outbreak detection Biological/genetic modeling/analytics (hypothesis testing), exploration, hypothesis generation, sequence to time/space/event Data visualization, information dissemination, and alerting Target users (DTRA), dissemination strategies, surveillance, prediction System assessment and evaluation Target users (DTRA), acceptance, usability, 2016/7/12 situation awareness, decision making 58 BioPortal Information Hsinchun Chen hchen@eller.arizona.edu AI Lab project information http://ai.arizona.edu 2016/7/12 59