SemGraphs - UNED NLP Group, Madrid

advertisement
The Use of Semantic
Graphs for Modeling
Biomedical Text
Laura Plaza
NIL- Natural Interaction based on Language
Universidad Complutense de Madrid
Text
summarization
Semantic graph
based
representation
Automatic Indexing
Information
Retrieval
Why semantic?
Synonymy
Cerebrovascular
diseases during
pregnancy may result
from hemorrhage
Polysemy
=
Brain vascular
disorders during
gestation may result
from hemorrhage
The common cold is more common
in cold weather than in summer
Why graphs?
PneumonIa
Pneumococcal infection is a lung
infection caused by streptococcus
pneumonia.
Mycoplasma pneumonia is another
type of atypical phneumonia.
The patient referred feeling short of breath
Symptom
and was diagnosed with pneumonia
Pneumococcal
pneumonia
Co-occurs with
influenza
Our Proposal

Using concepts and relations from
external knowlegde sources for
representing the text as a graph

Exploiting the topology of the network
to identify groups of concepts
semantically related that represent
different topics
Representation Process
Document pre-processing
Concept identification
Document representation
Concept clustering and topic recognition
Document preprocessing
Concept Identification
The goal of the trial was to assess cardiovascular
mortality for stroke
Concepts
Goals (Intellectual Product)
Clinical Trials (Research Activity)
Cardiovascular system (Body System)
Mortality vital statistics (Quantitative Concept)
Cerebrovascular accident (Disease or Syndrome)
Concept Identification - Ambiguity
Tissues are often cold
Phrase: “Tissues”
Meta Mapping (1000)
1000 C0040300:Tissues (Body tissue)
Phrase: “are”
Phrase: “often cold”
MetaMapping (888)
694 C0332183:Often (Frequent)
861 C0234192:Cold (Cold Sensation)
MetaMapping (888)
694 C0332183:Often (Frequent)
861 C0009443:Cold (Common Cold)
MetaMapping (888)
694 C0332183:Often (Frequent)
861 C0009264:Cold (Cold temperature)
WSD
• Personalized
PageRank (PPR)
• Journal Descriptor
Indexing (JDI)
• Machine Readable
Dictionary (MRD)
• Automatic Extracted
Corpus (AEC)
Document Representation
The goal of the trial was to assess cardiovascular mortality and morbidity for
stroke, coronary heart disease and congestive heart failure, as an evidencebased guide for clinicians who treat hypertension.
Activity
Disease
Personnel
Anatomic Structure
Clinical or
Research Activity
System or
Substance
Research Activity
Organ System
Study
Clinical Study
Clinical Trials
Professional
Personnel
Disorder Or Finding
Finding by Site or
System
Disease or Disorder
Cardiovascular
Cardiovascular System Finding Disorder by Site
System
Respiratory and
Blood Pressure Thoracic Disorder
Finding
Hypertensive
Disease
Thoracic Disorder
Clinicians
Non-Neoplastic Disorder
Non-Neoplastic
Disorder by Site
Non-Neoplastic
Cardiovascular Disorder
Non-Neoplastic
Vascular Disorder
Non-Neoplastic
Heart Disorder
Heart Disorder
Cerebrovascular
Disorder
Coronary Heart Disease
Cerebrovascular
Accident
Congestive Heart
Failure
Document Representation

All the sentence graphs are merged into a
single Document Graph

The graph is extended with more semantic
relations

Each edge is assigned a weight in [0, 1]

Different relations may be assigned different
weights

The more specific are the concepts, the more
weight is assigned to the edge
The goal of the trial was to assess cardiovascular mortality and morbidity for stroke, coronary heart disease
and congestive heart failure, as an evidence-based guide for clinicians who treat hypertension.
While event rates for fatal cardiovascular disease were similar, there was a disturbing tendency for stroke to
occur more often in the doxazosin group, than in the group taking chlorthalidone
Disease or Disorder
Disorder by Site
Finding by Site or
System
Organ System
Cardiovascular
System
Respiratory and
Thoracic Disorder
Cardiovascular System
Finding
Thoracic Disorder
Blood Pressure
Finding
Non-Neoplastic Cardiovascular Disorder
Non-Neoplastic
Vascular Disorder
Cerebrovascular Disorder
Coronary Heart
Disease
Non-Neoplastic
Heart Disorder
Congestive
Heart Failure
Cerebrovascular
Accident
Pharmaceutical
Adjuvant
Cardiovascular
Drug
Research Activity
1/2
1/2
Diuretic
2/3
2/3
Thiazide Diuretics
Doxazosin
Cardiovascular
Diseases
Non-Neoplastic Disorder by Site
Heart Disorder
Hypertensive
Disease
Alpha-Adrenergic
Blocking Agent
Disorder of
Cardiovascular System
Non-Neoplastic Disorder
1
3/4
Study
Clinicians
Clinical Study
Chlorthalidone
Clinical Trials
Is a relations
Other related relations
Associated with relations
Concept Clustering & Topic
Recognition
hubs
.
.
.
Concept Clustering & Topic
Recognition
Salience ( vi ) 
 weight (e )
j
e j vk e j connect ( v j ,vk )
Concepts are ranked by salience
 The n vertices with a highest salience are
called hub vertices

Concept Clustering & Topic
Recognition
The hub vertices are grouped into Hub
Vertex Sets (HVSs)
 The remaining vertices are assigned to the
cluster to which they are more connected
 The number and properties of the
clustering strongly depends on the
parameters’ values

Concept Clustering & Topic
Recognition
Adverse reactions
Congestive heart failure
Amlodipine
Chlorthalidone
Drug pseudoallergen by function
Blood pressure finding
Cerebrovascular
accident
Hepatic
.
.
.
Health
personnel
Elderly
Organism
Population
group
Persons
Clinicians
Patients
Text
summarization
Semantic graph
based
representation
Automatic Indexing
Information
Retrieval
Text Summarization
Creating a compacted version of one or various
documents
Motivation
Types
 Summaries as an indication
of what a document is about
 Improving indexing,
categorization, and IR
 Extracts vs. abstracts
 Single vs. multi-document
 Generic vs. Application-oriented
Text Summarization
Similarity = 35.0
Sentence1
Cluster 1
.
.
.
.
.
.
Sentence n
Cluster m
similarity (Ci , S j ) 
w
k, j
vk vk S j
vk  Ci  wk ,i , j  0

vk  HVS (Ci )  wk ,i , j  1.0

vk  HVS (Ci )  wk ,i , j  0.5
Text Summarization
Cluster 1
…
Cluster n
Sentence 1 (98,.0)
…
Sentence 6 (18.0)
Sentence n (28.0)
…
Sentence 3 (1.0)
….
…
…
Sentence
selection
 H.1: Selecting the top n ranked sentences
from the biggest cluster
 H.2: Selecting ni sentences from each
cluster
 H.3: Weighting the sentence-to-cluster
similarity to the clusters’ sizes
+
other traditional criteria: frequency, position,
similarity with the title, etc
Text Summarization

Evaluation: How is the important content
preserved in the summary?
 ROUGE automatic evaluation metrics
 Comparison with the abstracts of the articles
ROUGE-2
ROUGE-SU4
H. 3*
0.3538
0.3267
H.2*
0.3421
0.3205
H.1*
0.3453
0.3189
LexRank
0.3248
0.3097
SUMMA
0.3187
0.2989
AutoSummarize
0.2446
0.2318
Text Summarization

Evaluation: How does ambiguity affect
summarization?
ROUGE-2
ROUGE-SU4
AEC
0.3670
0.3379
MRD
0.3611
0.3341
JDI
0.3538
0.3267
First mapping
0.3283
0.3117
Summarization of Biological Entityrelated Information

Given a list of genes (or proteins):
1. Retrieving documents related to the genes
2. Building a sematic graph-based representation
of the corpus
3. Identifying groups of genes/proteins
4. Generating a summary for each group that
describes the functionality of the entities
Multi-document, application-oriented
summarization
Automatic Indexing of Biomedical
Literature using Summaries
Title + Abstract
Full text
MTI
Ordered list of MeSH main headings
Refined list of MeSH Headings
Automatic Indexing of Biomedical
Literature using Summaries
What about using the full texts?
◦ Recall increases by precision decreases
What about using automatic summaries of
different lenghts?
◦ As the lenght increases, recall improves but
precision worsens
◦ There is a summary lenght which maximizes
F-measure
Text
summarization
Semantic graph
based
representation
Automatic Indexing
Information
Retrieval
Retrieval of Similar Patient Cases
Motivation:
Facilitating the access to previous cases
Problem:
Given a reference patient record, to retrieve
others from the clinical database that are
similar to the reference one
Retrieval of Similar Patient Cases
When can we consider that two patient
records are similar?
 Same symptom or sign (e.g. ,
fever)
 Same diagnosis (e.g. bacterial
pneumonia)
 Same test or procedure (e.g.,
endoscopy biopsy)
 Same medication (e.g. clopidogrel)
 But … absent criteria are not
relevant!!!
Retrieval of Similar Patient Cases



The records are represented using UMLS graphs
Concepts are filtered by semantic types
Negated concept are ignored
Category
UMLS Semantic Types
Sign or Symptom
Symptoms and Signs
Finding
Disease or Syndrome
Diseases
Pathologic Function
Therapeutic or Preventive Procedure
Procedures
Diagnosis Procedure
Body Location or Region
Body Parts
Body Part, Organ, or Organ Component
Medicaments
Pharmacologic substance
Retrieval of Similar Patient Cases

We compute the similarity among the reference
record and all records in the database
Graph A
Graph B
Clinical finding
1/11
Finding by site
2/11
Respiratory
Disorder by
finding
body site
3/5
3/11
...
Functional finding
of respiratory tract
8/11
4/5
Bacterial
Coughing
pneumonia
5/5
9/11
Clinical finding
Finding by site
Disorder by
body site
Disease
Infectious
disease
...
Virus Diseases
Bacterial
pneumonia
Similarity 
Votes
 0,4869
MaxSimilarity
Pneumonia due Pneumonia due to
to Streptococcus anaerobic bacteria
10/11
Pneumococcal Pneumonia due
pneumonia to pleuropneumonia
11/11
Similarity 
Mycoplasma
1
pneumonia
 
9
1 2


...

11 11
11
2
11  3 4 5 
 ...       
11  5 5 5 
11 11
Text
summarization
Semantic graph
based
representation
Automatic Indexing
Information
Retrieval
Automatic Indexing of EHR

Discovering relevant SNOMED-CT
concepts in health records
4 steps
1.
2.
3.
4.
Spell checking
Acronym expansion and WSD
Negation detection
Concept identification
Automatic Indexing of EHR
Spell Checking
1.
◦
Hunspell + Levenshtein + keyboard + phonetic distance
Automatic Indexing of EHR
Acronym expansion and WSD
2.
◦
A list of abbreviation + Machine Learning + expert rules
Automatic Indexing of EHR
Negation detection
1.
◦
◦
NegEx algorithm Spanish adaptation
Negation cue + Negation scope
Automatic Indexing of EHR
4.
Concept identification
Query
El recién nacido fue ingresado
SNOMED-CT
concept
descriptions
Candidate mappings
-Recién nacido.
- Recién nacido prematuro.
- Ingreso del paciente.
Scoring
function
Final mappings
-Recién nacido.
- Ingreso del paciente.
Automatic Indexing of EHR
Automatic Indexing of EHR
Future work

◦
◦
◦
Representing the EHR as a graph using
different relations from SNOMED-CT
Computing the salience of the concepts to
obtain the most representative ones
Using such representation in different NLP
tasks (e.g., categorization, IR, etc.)
Further Readings
Summarization
Plaza, L., Díaz, A., Gervás, P. (2011). A semantic graph-based approach to biomedical
summarization. Artificial Intelligence in Medicine,53.
Plaza, L. (2012). Evaluating the importance of sentence position for automatic
summarization of biomedical literature. Submitted to Bioinformatics
Word Sense Disambiguation
Plaza, L., Stevenson, M., Díaz, A. (2012). Resolving Ambiguity in Biomedical Text to
Improve Summarization. Information Processing & Management, 48(4).
Plaza, L., Jimeno-Yepes, A., Díaz, A., Aronson, A.(2011).Studying correlation between
different word sense disambiguation methods and summarization effectiveness
in biomedical texts. BMC Bioinformatics, 12.
Automatic Indexing
Jimeno-Yepes, A., Plaza, L., Mork, J., Díaz, A., Aronson, A.(2012).Using automatic
summaries to improve automatic indexing. To appear in BMC Bioinformatics.
Retrieval of Similar Cases
Plaza, L., Díaz, A.(2010).Retrieval of Similar Electronic Health Records using UMLS
Concept Graphs. 15th International Conf. on Applications of Natural Language to
Information Systems.
Download