The GAAIN Entity Mapper - Big Data for Discovery Science

advertisement
GEM: The GAAIN Entity Mapper
Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga
USC Stevens Neuroimaging and Informatics Institute
Keck School of Medicine of USC
July 9th, 2015
At the 11th Data Integration in Life Sciences Conference (DILS) 2015
Marina del Rey
Introduction: GAAIN
• GAAIN: Global Alzheimer’s Association Interactive Network
• Current
• Data integrated from 30+ sources
• Over 250,000 research subjects
• Access http://www.gaain.org
Data Integration in GAAIN
• Data
• Subject research data
• Well structured
• (Mostly) relational
• Data harmonization
• Common data model
• MAP datasets to common model
• Data ownership sentsitivity
Data Mapping
The Data Mapping Problem
• Resource intensive
•
“On average, converting a database to the OMOP CDM, including mapping
terminologies, required the equivalent of four full-time employees for 6 months and
significant computational resources for each distributed research partner. Each partner
utilized a number of people with a wide range of expertise and skills to complete the
project, including project managers, medical informaticists, epidemiologists, database
administrators, database developers, system analysts/ programmers, research assistants,
statisticians, and hardware technicians. Knowledge of clinical medicine was critical to
correctly map data to the proper OMOP CDM tables. “
• Complexity of data harmonization
• Several thousand data elements per dataset
• Multiple datasets
• Data elements
• Complex scientific concepts
• Cryptic names
• Domain expertise to interpret
Observations
• Rich element information in documentation
•
Data dictionaries !
• Element information
•
•
Descriptions
Metadata
• Need better approaches to matching element names
•
•
MOMDEMYR1
PTGNDR
Data Dictionaries
• Rich element details
Approach
• Extract element description and metadata details
from data dictionaries
• Determine element matches based on above
• Block improbable match candidates based on
metadata
• Determine element similarity (and thus match
likelihood) based on name and description
similarity
• Initial version of system knowledge-driven, then
added machine-learning classification
GEM: A Software Assistant for
Data Mapping
GEM Architecture
Element Extraction
•
√
Extract and segregate
element information
Metadata Detail Extraction
•
•
Element categories
Four categories
(i) Special
(ii) Coded
Binary
Other coded
(iii) Numerical
(iv) Text
Classifier
Heuristic based
Other metadata details
Cardinality
Range (min, max)
√
MDB: The Metadata Database
• Extracted detailed metadata per element







Source
Name
Description
Legend
Cardinality
Range
Category
√
9/8/14
Matching: Metadata Based “Blocking”
•
•
Elimination of candidates
Eliminate candidates from second source that are
incompatible
Incompatibility criteria
- Category mismatch
- Cardinality mismatch
- For coded elements
- Assume normal distribution with SD of 1
- Range mismatch
√
9/8/14
Matching Text Descriptions
•
•
Employ a regular Tfidf cosine distance on bag-of-words
Based on unsupervised topic modeling (LDA)
- Treat element descriptions as ‘documents’
- Topic model over these documents
- Each element (description) has a probability distribution
over topics
- Element similarity (or distance) based on similarity (not)
of associated topic distributions
√
Element Name Matching
•
Composite element names
PT GENDER
MOMDEM
P AT G N D R
FHQDEMYR1
Table Correspondence
•
Elements generally do match across
‘corresponding’ tables
•
Literal table names not scalable as a
feature
•
Determine table correspondence
heuristically, based on knowledge
driven match likelihood
𝑇𝐶𝑆 𝑒𝑆, 𝑒𝑇 =
Ʃ𝑎𝑙𝑙 𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝑇𝑎𝑏
𝑒𝑆
𝑀 𝑒𝑆, 𝑒𝑇
min 𝑂 𝑇𝑎𝑏 𝑒𝑆 , 𝑂 𝑇𝑎𝑏 𝑒𝑇
Experimental Results
•
Setup
•
Various data dictionaries
•
•
•
ADNI, NACC, DIAN, LAADC, INDD
Mapping pairs
•
Pairs of datasets
•
• ADNI-NACC, ADNI-INDD, ADNI-LAADC, …
Dataset to GAAIN Common Model (GCM)
• ADNI-GCM, NACC-GCM, …
Experiments
•
Mapping accuracy
•
Effectiveness of individual components
•
Topic Modeling (text description) match and Filtering
•
Comparison with related systems
•
System parameters
Related Systems
1) Coma++
http://dbs.unileipzig.de/Research/coma.html
•
•
•
More suited for ‘semantic’, ontology
integration tasks
Based on XML (nested structure) similarity
No support for incorporating element
descriptions
1) Harmony
http://openii.sourceforge.net
•
•
System targets exactly the same mapping
problem as ours
Utilizes element name similarity and also
element descriptions in matching
9/8/14
Evaluated What
• Taken mappings pairwise
•
Dataset pairs
• ADNI-NACC, ADNI-INDD and ADNI-LAADC
•
Goldsets: ~ 150 element pairs (created manually)
•
To GAAIN Common Model
• ADNI-GAAIN Common Model
• 24 GAAIN Common Model elements
• Report
•
Accuracy in terms of F-Measure (Precision and Recall)
• Against N – the size of result alternatives per match
•
Matching algorithms
(i) Harmony
(ii) TFIDF
9/8/14
(iii) Topic Modeling for text match
(iv) Topic Modeling + Metadata Filtering
Results
ADNI to NACC
Results
ADNI to LAADC
Results
ADNI to INDD
Results
ADNI to GAAIN Common Model
Training Topic Model
Comparison
Common Model Mapping
Conclusions from Evaluation
•
•
As a medical dataset mapping tool
•
High mapping accuracy (90% and above) possible for datasets in this
domain
•
Significantly higher mapping accuracy compared to available schema
mapping systems like Coma++ and Harmony
From a matching approach perspective
•
•
No universally superior for text similarity matching
•
Topic modeling based text matching provides significantly higher mapping
accuracies as opposed to TfIdf when the descriptions are not exactly same
•
TfIdf outperforms topic modeling when descriptions are exactly same
Metadata based blocking is beneficial
Internal system
•
Mapping accuracy is sensitive to topic model parameters
•
•
Hyperparameters in the underlying “LDA’ topic model
Filter first, then match – better than  Match, then eliminate
Data Understanding: Model
Discovery Using GEM
• Identifying data elements for a common data model
over collection of multiple, disparate datasets
• Common data model design is a complex problem
• GEM helps significantly in the bottom up design of
common data model
• For each column of source, corresponding matches
from all destination sources given
Current Work
• Machine-learning classification
• Text similarity, name similarity, table
correspondence …
• Active-learning for training
• Data dictionary ingestion
Links
1) http://www.gaain.org
2) http://www-hsc.usc.edu/~ashish/ADT.htm
Thank you !
nashish@loni.usc.edu
Download