The Use of Semantic Graphs for Modeling Biomedical Text Laura Plaza NIL- Natural Interaction based on Language Universidad Complutense de Madrid Text summarization Semantic graph based representation Automatic Indexing Information Retrieval Why semantic? Synonymy Cerebrovascular diseases during pregnancy may result from hemorrhage Polysemy = Brain vascular disorders during gestation may result from hemorrhage The common cold is more common in cold weather than in summer Why graphs? PneumonIa Pneumococcal infection is a lung infection caused by streptococcus pneumonia. Mycoplasma pneumonia is another type of atypical phneumonia. The patient referred feeling short of breath Symptom and was diagnosed with pneumonia Pneumococcal pneumonia Co-occurs with influenza Our Proposal Using concepts and relations from external knowlegde sources for representing the text as a graph Exploiting the topology of the network to identify groups of concepts semantically related that represent different topics Representation Process Document pre-processing Concept identification Document representation Concept clustering and topic recognition Document preprocessing Concept Identification The goal of the trial was to assess cardiovascular mortality for stroke Concepts Goals (Intellectual Product) Clinical Trials (Research Activity) Cardiovascular system (Body System) Mortality vital statistics (Quantitative Concept) Cerebrovascular accident (Disease or Syndrome) Concept Identification - Ambiguity Tissues are often cold Phrase: “Tissues” Meta Mapping (1000) 1000 C0040300:Tissues (Body tissue) Phrase: “are” Phrase: “often cold” MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0234192:Cold (Cold Sensation) MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0009443:Cold (Common Cold) MetaMapping (888) 694 C0332183:Often (Frequent) 861 C0009264:Cold (Cold temperature) WSD • Personalized PageRank (PPR) • Journal Descriptor Indexing (JDI) • Machine Readable Dictionary (MRD) • Automatic Extracted Corpus (AEC) Document Representation The goal of the trial was to assess cardiovascular mortality and morbidity for stroke, coronary heart disease and congestive heart failure, as an evidencebased guide for clinicians who treat hypertension. Activity Disease Personnel Anatomic Structure Clinical or Research Activity System or Substance Research Activity Organ System Study Clinical Study Clinical Trials Professional Personnel Disorder Or Finding Finding by Site or System Disease or Disorder Cardiovascular Cardiovascular System Finding Disorder by Site System Respiratory and Blood Pressure Thoracic Disorder Finding Hypertensive Disease Thoracic Disorder Clinicians Non-Neoplastic Disorder Non-Neoplastic Disorder by Site Non-Neoplastic Cardiovascular Disorder Non-Neoplastic Vascular Disorder Non-Neoplastic Heart Disorder Heart Disorder Cerebrovascular Disorder Coronary Heart Disease Cerebrovascular Accident Congestive Heart Failure Document Representation All the sentence graphs are merged into a single Document Graph The graph is extended with more semantic relations Each edge is assigned a weight in [0, 1] Different relations may be assigned different weights The more specific are the concepts, the more weight is assigned to the edge The goal of the trial was to assess cardiovascular mortality and morbidity for stroke, coronary heart disease and congestive heart failure, as an evidence-based guide for clinicians who treat hypertension. While event rates for fatal cardiovascular disease were similar, there was a disturbing tendency for stroke to occur more often in the doxazosin group, than in the group taking chlorthalidone Disease or Disorder Disorder by Site Finding by Site or System Organ System Cardiovascular System Respiratory and Thoracic Disorder Cardiovascular System Finding Thoracic Disorder Blood Pressure Finding Non-Neoplastic Cardiovascular Disorder Non-Neoplastic Vascular Disorder Cerebrovascular Disorder Coronary Heart Disease Non-Neoplastic Heart Disorder Congestive Heart Failure Cerebrovascular Accident Pharmaceutical Adjuvant Cardiovascular Drug Research Activity 1/2 1/2 Diuretic 2/3 2/3 Thiazide Diuretics Doxazosin Cardiovascular Diseases Non-Neoplastic Disorder by Site Heart Disorder Hypertensive Disease Alpha-Adrenergic Blocking Agent Disorder of Cardiovascular System Non-Neoplastic Disorder 1 3/4 Study Clinicians Clinical Study Chlorthalidone Clinical Trials Is a relations Other related relations Associated with relations Concept Clustering & Topic Recognition hubs . . . Concept Clustering & Topic Recognition Salience ( vi ) weight (e ) j e j vk e j connect ( v j ,vk ) Concepts are ranked by salience The n vertices with a highest salience are called hub vertices Concept Clustering & Topic Recognition The hub vertices are grouped into Hub Vertex Sets (HVSs) The remaining vertices are assigned to the cluster to which they are more connected The number and properties of the clustering strongly depends on the parameters’ values Concept Clustering & Topic Recognition Adverse reactions Congestive heart failure Amlodipine Chlorthalidone Drug pseudoallergen by function Blood pressure finding Cerebrovascular accident Hepatic . . . Health personnel Elderly Organism Population group Persons Clinicians Patients Text summarization Semantic graph based representation Automatic Indexing Information Retrieval Text Summarization Creating a compacted version of one or various documents Motivation Types Summaries as an indication of what a document is about Improving indexing, categorization, and IR Extracts vs. abstracts Single vs. multi-document Generic vs. Application-oriented Text Summarization Similarity = 35.0 Sentence1 Cluster 1 . . . . . . Sentence n Cluster m similarity (Ci , S j ) w k, j vk vk S j vk Ci wk ,i , j 0 vk HVS (Ci ) wk ,i , j 1.0 vk HVS (Ci ) wk ,i , j 0.5 Text Summarization Cluster 1 … Cluster n Sentence 1 (98,.0) … Sentence 6 (18.0) Sentence n (28.0) … Sentence 3 (1.0) …. … … Sentence selection H.1: Selecting the top n ranked sentences from the biggest cluster H.2: Selecting ni sentences from each cluster H.3: Weighting the sentence-to-cluster similarity to the clusters’ sizes + other traditional criteria: frequency, position, similarity with the title, etc Text Summarization Evaluation: How is the important content preserved in the summary? ROUGE automatic evaluation metrics Comparison with the abstracts of the articles ROUGE-2 ROUGE-SU4 H. 3* 0.3538 0.3267 H.2* 0.3421 0.3205 H.1* 0.3453 0.3189 LexRank 0.3248 0.3097 SUMMA 0.3187 0.2989 AutoSummarize 0.2446 0.2318 Text Summarization Evaluation: How does ambiguity affect summarization? ROUGE-2 ROUGE-SU4 AEC 0.3670 0.3379 MRD 0.3611 0.3341 JDI 0.3538 0.3267 First mapping 0.3283 0.3117 Summarization of Biological Entityrelated Information Given a list of genes (or proteins): 1. Retrieving documents related to the genes 2. Building a sematic graph-based representation of the corpus 3. Identifying groups of genes/proteins 4. Generating a summary for each group that describes the functionality of the entities Multi-document, application-oriented summarization Automatic Indexing of Biomedical Literature using Summaries Title + Abstract Full text MTI Ordered list of MeSH main headings Refined list of MeSH Headings Automatic Indexing of Biomedical Literature using Summaries What about using the full texts? ◦ Recall increases by precision decreases What about using automatic summaries of different lenghts? ◦ As the lenght increases, recall improves but precision worsens ◦ There is a summary lenght which maximizes F-measure Text summarization Semantic graph based representation Automatic Indexing Information Retrieval Retrieval of Similar Patient Cases Motivation: Facilitating the access to previous cases Problem: Given a reference patient record, to retrieve others from the clinical database that are similar to the reference one Retrieval of Similar Patient Cases When can we consider that two patient records are similar? Same symptom or sign (e.g. , fever) Same diagnosis (e.g. bacterial pneumonia) Same test or procedure (e.g., endoscopy biopsy) Same medication (e.g. clopidogrel) But … absent criteria are not relevant!!! Retrieval of Similar Patient Cases The records are represented using UMLS graphs Concepts are filtered by semantic types Negated concept are ignored Category UMLS Semantic Types Sign or Symptom Symptoms and Signs Finding Disease or Syndrome Diseases Pathologic Function Therapeutic or Preventive Procedure Procedures Diagnosis Procedure Body Location or Region Body Parts Body Part, Organ, or Organ Component Medicaments Pharmacologic substance Retrieval of Similar Patient Cases We compute the similarity among the reference record and all records in the database Graph A Graph B Clinical finding 1/11 Finding by site 2/11 Respiratory Disorder by finding body site 3/5 3/11 ... Functional finding of respiratory tract 8/11 4/5 Bacterial Coughing pneumonia 5/5 9/11 Clinical finding Finding by site Disorder by body site Disease Infectious disease ... Virus Diseases Bacterial pneumonia Similarity Votes 0,4869 MaxSimilarity Pneumonia due Pneumonia due to to Streptococcus anaerobic bacteria 10/11 Pneumococcal Pneumonia due pneumonia to pleuropneumonia 11/11 Similarity Mycoplasma 1 pneumonia 9 1 2 ... 11 11 11 2 11 3 4 5 ... 11 5 5 5 11 11 Text summarization Semantic graph based representation Automatic Indexing Information Retrieval Automatic Indexing of EHR Discovering relevant SNOMED-CT concepts in health records 4 steps 1. 2. 3. 4. Spell checking Acronym expansion and WSD Negation detection Concept identification Automatic Indexing of EHR Spell Checking 1. ◦ Hunspell + Levenshtein + keyboard + phonetic distance Automatic Indexing of EHR Acronym expansion and WSD 2. ◦ A list of abbreviation + Machine Learning + expert rules Automatic Indexing of EHR Negation detection 1. ◦ ◦ NegEx algorithm Spanish adaptation Negation cue + Negation scope Automatic Indexing of EHR 4. Concept identification Query El recién nacido fue ingresado SNOMED-CT concept descriptions Candidate mappings -Recién nacido. - Recién nacido prematuro. - Ingreso del paciente. Scoring function Final mappings -Recién nacido. - Ingreso del paciente. Automatic Indexing of EHR Automatic Indexing of EHR Future work ◦ ◦ ◦ Representing the EHR as a graph using different relations from SNOMED-CT Computing the salience of the concepts to obtain the most representative ones Using such representation in different NLP tasks (e.g., categorization, IR, etc.) Further Readings Summarization Plaza, L., Díaz, A., Gervás, P. (2011). A semantic graph-based approach to biomedical summarization. Artificial Intelligence in Medicine,53. Plaza, L. (2012). Evaluating the importance of sentence position for automatic summarization of biomedical literature. Submitted to Bioinformatics Word Sense Disambiguation Plaza, L., Stevenson, M., Díaz, A. (2012). Resolving Ambiguity in Biomedical Text to Improve Summarization. Information Processing & Management, 48(4). Plaza, L., Jimeno-Yepes, A., Díaz, A., Aronson, A.(2011).Studying correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts. BMC Bioinformatics, 12. Automatic Indexing Jimeno-Yepes, A., Plaza, L., Mork, J., Díaz, A., Aronson, A.(2012).Using automatic summaries to improve automatic indexing. To appear in BMC Bioinformatics. Retrieval of Similar Cases Plaza, L., Díaz, A.(2010).Retrieval of Similar Electronic Health Records using UMLS Concept Graphs. 15th International Conf. on Applications of Natural Language to Information Systems.