Knowledge Graph Search - salsahpc

advertisement
Knowledge Graph:
Connecting Big Data Semantics
Ying Ding
Indiana University
Entity in Big Data
• Entity: things, not strings
• Relationship matters: connecting entities
• Changing in searching:
– string entityrelationsubgraph
entity
relations
Entities
• Entities in social web: person, location, organization, book,
music (freebase.com: Metformin)
• Entities in translational medicine: gene, drug, disease,
protein, side effect (conceptwiki: Disease Lafora)
• Data: scientific papers (PubMed, PubMed central), and
experimental data (SwissPro, KEGG, DrugBank,)
Challenges
• Knowledge Graph – Entity Graph
– Schema graph (small size) vs. Instance graph (large
size)
– Graph mining (e.g. shortest path, depthfirst/breath first, pagerank)
• Neo4j, NoSQL graph database,
– Graph pattern search (SPARQL)
• Triple store, virtuoso (openlinksw )
Use Case: Individualized Cohort in EHR
• EHR-based individualized cohort can provide a better solution
than standard guidelines because the cohort is drawn from a
patient population of the same geolocation, demographics,
and socio-economic group to the given patient.
• EHRs are organized around the patient, not by concepts
(diseases, lab results, medications, etc.)
Use Case: Individualized Cohort in EHR
• EHR data contains controlled vocabularies (e.g., demographics,
diagnostic codes, medications, procedures, etc.) and continuous
values (e.g., lab tests, medication doses, etc.).
– Category hierarchy (parent, siblings, subtrees): search patients like a given
diagnosis “ICD10:E11.21” (diabetes with nephropathy)  ICD10:E11.22
(with chronic kidney disease) ICD10:E11 (diabetes in general)
– Continuous values: serum glucose = 120 mg/dL (many continuous values
may not have a natural aggregate binning)
– Query for searching patients are rarely exact (fasting serum glucose =126
serum glucose between 120 and 130), or serum glucose in the 80th
percentile at this time
• A patient can have 100-100,000 property values which contain 100
controlled vocabulary values and 1000 continuous values. Most
values are time based.
Challenges
• Searching challenges
– Category hierarchy (parent, siblings, subtrees): search patients like a given
diagnosis “ICD10:E11.21” (diabetes with nephropathy)  ICD10:E11.22
(with chronic kidney disease) ICD10:E11 (diabetes in general)
– Continuous values: serum glucose = 120 mg/dL (many continuous values
may not have a natural aggregate binning)
– Query for searching patients are rarely exact (fasting serum glucose =126
serum glucose between 120 and 130), or serum glucose in the 80th
percentile at this time
– Map the changes in value with changes in time: search for a patient for a
60th% to 90th% transition between two serum glucose over a 6 month time
frame. If we have N glucose values, for any two patient, we have to
compare N*(N-1)/2 time-based glucose-value comparison. How to scale it
up?
– Find common patterns from a set of individualized cohort patients. This
means compare with the combination of subsets of million’s of
differentials for each patient in the cohort.
Relational Database Semantic Graph
• Paradigm shift from relational row-column
lookup to semantic graph traversal
– Relational Database is less efficient in joins,
– Big indexing overhead (need to indexing every
column)
EHR RDF Graph
Patient EHR data in semantic graph representation. EHR timeline for Patient A
and B are shown as RDF graphs. Property values of each patient (demographics,
labs, diagnosis, etc.) are connected to their respective ontologies. Enabling
searching for patterns across different patients.
EHR RDF Graph
Application of continuous value classes will enrich the patients retrieved from the database.
2A. Property values as literal nodes will not link “like” patients together without a “relational” query.
2B. By using controlled vocabulary (CV)-ontology edges, we will be able to link patients through CV-value
nodes.
2C. By adding “nearby” classes to continuous value nodes, we will link additional patients. Different
strategies will create different “nearby” links.
Challenges: Semantic Graph Mining
• Graph indexing
– gIndex: indexing frequent subgraphs, using subgraphs as
features
• Graph classification, clustering
– Path-based clustering and top-k similarity problems in
heterogeneous information network
• Path-based graph mining
– Complex dependencies within heterogeneous network
– Conventional supervised classification methods assume
that the objects are independent
– Sequential matching vs. snapshot matching as EHR records
have a time dimension.
Linked Open Data
Challenges for Semantic Web
• How to handle ontology graph + instance
graph
• How to handle inferred triples and existing
triples (reasoning)
• Graph pattern search vs. Graph mining
• Datatype properties vs. object properties
• Different levels of semantics: ontology
(schema), categorized values (terminology),
continuous values (binning?), literal
Download