Knowledge Graph: Connecting Big Data Semantics Ying Ding Indiana University Entity in Big Data • Entity: things, not strings • Relationship matters: connecting entities • Changing in searching: – string entityrelationsubgraph entity relations Entities • Entities in social web: person, location, organization, book, music (freebase.com: Metformin) • Entities in translational medicine: gene, drug, disease, protein, side effect (conceptwiki: Disease Lafora) • Data: scientific papers (PubMed, PubMed central), and experimental data (SwissPro, KEGG, DrugBank,) Challenges • Knowledge Graph – Entity Graph – Schema graph (small size) vs. Instance graph (large size) – Graph mining (e.g. shortest path, depthfirst/breath first, pagerank) • Neo4j, NoSQL graph database, – Graph pattern search (SPARQL) • Triple store, virtuoso (openlinksw ) Use Case: Individualized Cohort in EHR • EHR-based individualized cohort can provide a better solution than standard guidelines because the cohort is drawn from a patient population of the same geolocation, demographics, and socio-economic group to the given patient. • EHRs are organized around the patient, not by concepts (diseases, lab results, medications, etc.) Use Case: Individualized Cohort in EHR • EHR data contains controlled vocabularies (e.g., demographics, diagnostic codes, medications, procedures, etc.) and continuous values (e.g., lab tests, medication doses, etc.). – Category hierarchy (parent, siblings, subtrees): search patients like a given diagnosis “ICD10:E11.21” (diabetes with nephropathy) ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general) – Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning) – Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time • A patient can have 100-100,000 property values which contain 100 controlled vocabulary values and 1000 continuous values. Most values are time based. Challenges • Searching challenges – Category hierarchy (parent, siblings, subtrees): search patients like a given diagnosis “ICD10:E11.21” (diabetes with nephropathy) ICD10:E11.22 (with chronic kidney disease) ICD10:E11 (diabetes in general) – Continuous values: serum glucose = 120 mg/dL (many continuous values may not have a natural aggregate binning) – Query for searching patients are rarely exact (fasting serum glucose =126 serum glucose between 120 and 130), or serum glucose in the 80th percentile at this time – Map the changes in value with changes in time: search for a patient for a 60th% to 90th% transition between two serum glucose over a 6 month time frame. If we have N glucose values, for any two patient, we have to compare N*(N-1)/2 time-based glucose-value comparison. How to scale it up? – Find common patterns from a set of individualized cohort patients. This means compare with the combination of subsets of million’s of differentials for each patient in the cohort. Relational Database Semantic Graph • Paradigm shift from relational row-column lookup to semantic graph traversal – Relational Database is less efficient in joins, – Big indexing overhead (need to indexing every column) EHR RDF Graph Patient EHR data in semantic graph representation. EHR timeline for Patient A and B are shown as RDF graphs. Property values of each patient (demographics, labs, diagnosis, etc.) are connected to their respective ontologies. Enabling searching for patterns across different patients. EHR RDF Graph Application of continuous value classes will enrich the patients retrieved from the database. 2A. Property values as literal nodes will not link “like” patients together without a “relational” query. 2B. By using controlled vocabulary (CV)-ontology edges, we will be able to link patients through CV-value nodes. 2C. By adding “nearby” classes to continuous value nodes, we will link additional patients. Different strategies will create different “nearby” links. Challenges: Semantic Graph Mining • Graph indexing – gIndex: indexing frequent subgraphs, using subgraphs as features • Graph classification, clustering – Path-based clustering and top-k similarity problems in heterogeneous information network • Path-based graph mining – Complex dependencies within heterogeneous network – Conventional supervised classification methods assume that the objects are independent – Sequential matching vs. snapshot matching as EHR records have a time dimension. Linked Open Data Challenges for Semantic Web • How to handle ontology graph + instance graph • How to handle inferred triples and existing triples (reasoning) • Graph pattern search vs. Graph mining • Datatype properties vs. object properties • Different levels of semantics: ontology (schema), categorized values (terminology), continuous values (binning?), literal