Semantic Relation Extraction for Linking Named Entities to Biomedical Databases Presenter: Lê Hoàng Quỳnh Knowledge of Technology Laboratory, UET, VNU lhquynh@gmail.com Hanoi, February 18th, 2012 Main contents • • • • Motivation and purpose Some approaches: the pros and cons Discussion and Proposal Conclusion 2 Motivation and purpose “… developing a state of the art named entities tagger for full open source biomedical texts …” • Deploying various named entity recognizers to see which works the best • Linking the named entities to its appropriate identifier in public databases 3 Motivation and purpose (cont’) What’re named entities we focus on ? • • • • Phenotype descriptions Disease names Gene names Chemical names 4 Motivation and purpose (cont’) Ontology = • Concept/Class • Term/Individual • Relation/Property 5 Motivation and purpose (cont’) The Biocaster Multilingual Ontology biocaster.org 6 Motivation and purpose (cont’) • How to link the named entities to unique identifiers in a biomedical database ? • What are the difference between “linking” and “filling” ? • Method ? • Clustering • Sematic relation extraction [LTB11] • … [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random 7 Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia. Motivation and purpose (cont’) Semantic relation extraction • Extracting relationships between terms is the task of extracting underlying relations between two term expressed by words or phrases [Gir08] • Due to the unique patterns of biomedical relations, techniques designed for extracting relations from general text may not be suitable for the biomedical domain [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 8 Motivation and purpose (cont’) What’re kinds of semantic relation we focus on ? Entity: • Hyponymy • Phenotype descriptions • Disease names • Synonymy • Gene names • Chemical names • Causal/effect • Indicate/hasSymptom • Treat • ….. 9 Motivation and purpose (cont’) What’re kinds of semantic relation we focus on ? Entity: • Hyponymy • Phenotype descriptions • Disease names • Synonymy • Gene names • Chemical names • Causal/effect • Indicate/hasSymptom • Treat • ….. 10 Some approaches Three groups of existing methods: • Pattern-based extraction relies on the occurrence of term pairs in the same contexts and uses the words in the context to identify the relation • Distributional clustering uses the contexts that terms occur in individually and attempts to group semantically related elements based on similarities of these contexts • Term variation is based on the form of the term and uses similarities between terms to identify, which are semantically related 11 Some approaches (cont’) Distributional clustering: • Considering the context that a term tends to occur in and then apply clustering to work out, which terms are most “similar". • By using this methodology they could found class of words that are similar in meaning For example: Use the verb "fire“ we to found these following class of nouns: o o o Gun, Missile, Weapon Shot, Bullet, Rocket, Missile Officer, Aide, Chief Manager 12 Some approaches (cont’) Distributional clustering: • Pros: o o Distributional clustering does not require that the terms occur in the same sentence or even in the same document Generally has a higher recall than pattern based methods • Cons: o o o This method requires a mathematical approach to determine the clusters of terms which have a similar distribution of contexts It is very difficult from distributional clustering to work out the nature of the relationship between the terms Distributional clustering is not suitable for extracting specific relationships such as if "X is a causal agent of Y“ 13 Some approaches (cont’) Term variation: • Looking at the form of the actual term and using the similarity of the words in it to deduce if the terms are related. For example: "cancer of the mouth" and "mouth cancer" • Jacquemin [Jac99] defines three main ways that term variation occurs: o o o Syntactic Variations Morpho-syntactic Variations Semantic Variations [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the14 Association for Computational Linguistics on Computational Linguistics, pages 341-348.1999. Some approaches (cont’) Term variation: • Pros: o o o Often has very high precision Strongest for finding if two terms are synonymous Can prove useful for some other cases as well • Cons: o Cannot help to identify relationships between terms with no similarity 15 Some approaches (cont’) Pattern-based extraction involve finding the terms in the same sentence and in some “pattern" that is suggestive of a particular relation. • Hearst [Hea92] used patterns to extract terms that exhibit the hyponymy relation • Her approach involved noting that such terms often occurred near each other in stereotypical patterns Some kinds of flu, such as bird flu are …” Pattern: noun phrase - “such as" - noun phrase hyponym(“bird flu", “flu") • Method for developing these patterns o o o o o Decide on a lexical relationship Collect a set of term pairs known to have this relationship and a corpus, which contains these pairs Find the places where these terms co-occur Find commonalities and hypothesize a pattern Use this pattern to find more term pairs and repeat the process “ [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992. 16 Some approaches (cont’) Pattern-based extraction • Pros: o Simple o Patterns have the advantage that they can be specialised for different relationships. o Can be used for various languages • Cons: o This method was manual o There was no way to provide a strong comparison between the effectiveness of the different patterns, which perhaps lead to the inclusion of a relatively “weak" pattern o It is not clear how to automatically generate patterns, which are specific to a given relationship and domain o As patterns rely on finding the two terms in the same context, this limits the recall and ambiguity in the text can cause errors in the extractions o Problem of identification boundaries of the terms 17 Some approaches (cont’) Mccrae’s approach [Mcc09] for synonym and hyponymy relation • Starts with the most general pattern, that is the pattern consisting of only wild cards • Develops a more specific pattern by replacing wild cards with terms from some corpus (full text chap. 3.1) [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009 18 Some approaches (cont’) Mccrae’s approach: • Problem of identification term’s boundary entity = (NN|JJ|NNS|NNP|FW|NNPS|JJR) * (NN|NNS|NNP|NNPS) NN: A singular noun NNS: A plural noun NNP: A proper noun NNPS: A pluralised proper noun JJ: An adjective FW: A prefix JJR: An adjective in comparative form 19 Some approaches (cont’) Mccrae’s approach: • Covers every possible variation of the patterns the the search space is far too large to be tractable It is necessary to find a way to cover this search space more efficiently o o prioritizing "better" patterns skipping those patterns which are too similar to existing patterns. 20 Some approaches (cont’) Mccrae’s approach: • Rule definition: *1 * such as *2 :Rule: :- name() words(1,1) "such" "as" name() • Simplified the rules o Match-set (Chap. 3.2.1 in full text) :- words(1,2) name() words(0,1) words(2,3) "literal" name() Simplified form: :- words(1,1) name() words(2,4) "literal" name() o Join-set and alignment (Chap 3.2.2 in full text) :- "a" name() "b" "c" "d" name() :- words(,1) name() words(2,3) "c" name() Alignment on these rules: f(2; 2); (4; 4); (6; 5)g The alignment-to-join conversion: :- words(,1) name() words(2,3) "c" words(0,1) name() words(0,0) Simplified form: :- name() words(2,3) "c" words(0,1) name() • Classification 21 Mccrae’s approach: Results Some approaches (cont’) 22 Mccrae’s approach: Results Some approaches (cont’) 23 Some approaches (cont’) Approach by utilizing the Web [SNR08] [TNN10] • RDF describes a SemanticWeb using RDF Statements, which are triples of the form <Subject, Property, Object> • Query the search engines with lexico-syntactic patterns to retrieve relevant information • The “seed” patterns are initially handcrafted but can be progressively learnt • Extract relations from snippets [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic 24 Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010 Some approaches (cont’) • [SNR08] focus on discovering causal relationship between a disease and a biological entity • Application: For augmenting Ontologies • Purpose: Given a disease discover the likely causes of this disease 25 Some approaches (cont’) Approaches summary and evaluation Method Precision Recall Applicability Patterns OK Limited Produces specfic results for any relationships Distributional Clustering OK Good Only produces a concept “semantic relatedness" Term Variation Good Poor Strongest for synonym, some use elsewhere of 26 Some approaches (cont’) What if using machine learning ? • Using CRF [BDS08]: o o Extracts both the existence of a relation and its type Using two type of CRF • Using Kernel-Based learning [LZL08]: o o Relation detection: a binary classification of true and false relations Relation classification: a 4-class classification of the four relation types [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008 27 Discussion and Proposal Challenges • Language complexity • Requirement good pre-processing (POS-tagging, chunking, NER, etc.) • … • Techniques designed for extracting relations from general text may not be suitable for the biomedical domain • Lack of tools, data • … • It is unlikely that the extracted relations will match the structure of the ontology 28 Discussion and Proposal Challenges • Modifiers: The inclusion of an adjective modifier in a term For example: "acute headache" & "headache“; “mental retardation” • Granularity: Terms are nearly always used synonymously but have slight differences in their meaning. For example: The term "HIV-1" is the most common strain of "HIV“ but "HIV-2" is less easily transmitted and mostly confined to a small area of West Africa • Property: This means that two terms refer to the same thing but with a slightly different property For example: "dengue shock syndrome" is a late stage development of "dengue fever 29 Discussion and Proposal (cont’) Compromises - Figure out what type of relationship or not - Binary classification or multi-label classification - 1 or 2 classifier - Pattern-based extraction, distributional clustering or term variation - Using machine learning or not - … 30 Discussion and Proposal (cont’) Proposal • Only deal with intra-sentence relations !!! • 2 classifiers • Pattern-based extraction and term variation • Semi-supervised learning • There is still not a strong definition or training resources for Phenotype and disease need to work on this using available resources such as the Human Phenotype Ontology and the CALBC data set from the EBI shared task 2011 31 What’s about the Model ? Discussion and Proposal (cont’) 32 Conclusion & Future Works • Purpose: Hyponymy, Synonym and Causal relation extraction for Phenotype descriptions, Disease names, Gene names and Chemical names • Improve on method (using semantic pattern & term variation, bootstrapping technique, etc.) • Exploring data and ontology • “Linking to ontology” review • Propose model • Try to use other available resources 33 References [LTB11] Hoang-Quynh Le, Mai-Vu Tran, Nhat-Nam Bui, Nguyen-Cuong Phan, Quang-Thuy Ha. An Integrated Approach Using Conditional Random Fields for Named Entity Recognition and Person Property Extraction in Vietnamese Text. In IALP 2011, Penang, Malaysia. [TNN10] Mai-Vu Tran, Tien-Tung Nguyen, Thanh-Son Nguyen, Hoang-Quynh Le (2010). "Automatic Named Entity Set Expansion Using Semantic Rules and Wrappers for Unary Relations", IALP 2010: 170-173, Harbin, Heilongjiang China; December 28-30, 2010 [Mcc09] John Philip Mccrae. Automatic Extraction of Logically Consistent Ontologies from Text Corpora. Doctor of philosophy. Department of informatics, school of multidisciplinary sciences, the graduate univesity of advanced studies. September 2009 [BDS08] Markus Bundschus, Mathaeus Dejori, Martin Stetter, Volker Tresp, Hans-Peter Kriege. Extraction of semantic biomedical relations from text using conditional random fields. BMC Bioinformatics 2008, 9:207 doi:10.1186/1471-2105-9-207 [Gir08] Girju R, “Semantic relation extraction and its applications”, ESSLLI 2008 Course Material, Hamburg, Germany, 4-15 August 2008 [LZL08] Jiexun Li, Zhu Zhang, Xin Li, Hsinchun Chen. Kernel-Based Learning for Biomedical Relation Extraction. journal of the american society for information science and technology, 59(5):756–769, 2008 [SNR08] Saurav Sahay, Shamkant B. Navathe, Ashwin Ram. Discovering Semantic Biomedical Relations Utilizing the Web. ACM Transactions on Knowledge Discovery from Data, Vol. 2, No. 1, Article 3. March 2008. [Jac99] Christian Jacquemin. Syntagmatic and paradigmatic representations of term variation. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 341348.1999. [Hea92] Marti A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on Computational linguistics, pages 539-545, 1992. [Bio] http://biocaster.org 34 Thank you for you attention! 35