XML, Standards, and Ontologies CSE 5095 Prof. Steven A. Demurjian, Sr. Computer Science & Engineering Department The University of Connecticut 371 Fairfield Road, Box U-255 Storrs, CT 06269-2155 steve@engr.uconn.edu http://www.engr.uconn.edu/~steve (860) 486 - 4818 XML-STDS-1 Overview CSE 5095 What is XML? How is it Used Today? XML Databases HL7 and CDA Other Standards MeSH Unified Medical Language System ICD9 and ICD9-CM (Intl. Classification Diseases) ICD10 and ICD10-CM SNOMED-CT (Clinical Terms) National Drug Codes (NDC) Ontologies – Biomedical and Clinical What are they? How are they Used? Can they be Improved? XML-STDS-2 What is one Possible Solution? CSE 5095 Standards and Usage of XML XML Used in Myriad of Context Modeling and Information Exchange (XML Schemas and Instances) XML Standards XACML – Access Control Markup Language OWL – Web Ontology Language HL7/CDA XML Databases What is/will be its Eventual Role in BMI? XML-STDS-3 Overview of XML CSE 5095 XML Overview, Tags, schema. XML Query Languages: XPath &XQuery XML Data Models Storage Strategy + XML DBMS: Relational, CMS, native Native XML DBMS: Pros/Cons. Biomedical Information and Databases BMI Standards and Examples: HL7 and CDA Survey of Technology XML-STDS-4 XML overview eXtensible CSE 5095 Markup Language Similar to HTML Meta-language that describes the content of the document (self-describing) XML is primarily used as a data storage and interchange medium XML exists in plain text format, however it may be compressed, or altered for transfer XML-STDS-5 XML overview cont. There CSE 5095 are no predefined data (tags), or grammer inherently in XML XML tags give an XML document structure and meaning Available tags are defined by a schema. All tags in an XML document come in pairs, open and close Tags are completely nested, and there is no ambiguity in their order XML-STDS-6 XML tags CSE 5095 XML tags may have an element field which is used to store information within the tag or Meta-data Plain text can be placed between tags and this text is not parsed CDATA is character data This means that any string of non-markup characters is legal as part of the attribute The ENTITY attribute type indicates that the attribute will represent an external entity in the document itself The ID attribute type if you want to specify a unique identifier for each element. XML-STDS-7 XML Schema The CSE 5095 structure of an XML document is defined by its schema. Dozens on languages to define XML schema: DTD W3C (XSD) NG - Relax This file can validate any instance of an XML document against it self. This file, or schema also defines allowable tags. XML-STDS-8 Sample XML Structure CSE 5095 XML employees a tree structure model for representing data (previous slide) shiporder shipto orderperson orderid name address city country item title name quantity price XML-STDS-9 Schema Example (XSD) CSE 5095 <?xml version="1.0" encoding="ISO-8859-1" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="shiporder"> <xs:complexType> <xs:sequence> <xs:element name="orderperson" type="xs:string"/> <xs:element name="shipto"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="address" type="xs:string"/> <xs:element name="city" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="item" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="title" type="xs:string"/> <xs:element name="note" type="xs:string" minOccurs="0"/> <xs:element name="quantity" type="xs:positiveInteger"/> <xs:element name="price" type="xs:decimal"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="orderid" type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema> XML-STDS-10 Querying XML - XPath Many CSE 5095 languages to query XML XPath and XQuery are W3C standards Xpath is a compact method of traversing previous tree Designed to facilitate use via URL/URI's /shiporder/item/name ← view all items' names Extensible to add user defined behaviors Treats each tag as a node in the tree XML-STDS-11 Querying XML - XQuery Functional CSE 5095 extension of XPath XML equivalent of SQL Navigate and manipulate document nodes. Works on collections of documents, or even fragments. FOR $b IN document("bib.xml")//book WHERE $b/publisher = "Morgan Kaufmann" AND $b/year = "1998" RETURN $b/title XML-STDS-12 XML Models Naively CSE 5095 there are two models of XML use: Data-centric Document-centric In reality, most XML use is a hybrid of the two More important is the database strategy used with XML Relational Content Managment Native XML XML-STDS-13 Data – Centric Model CSE 5095 Information is generally stored in a relational database XML is transport medium, nothing more Irrelevent to application that data exists as XML for some period of time Characteristics: Fine grained data. Data relationship is insignificant. Need to transfer relational information. Means of storing new information. XML-STDS-14 Document – Centric Model When CSE 5095 XML is utilized soley as a document This pesentation in Open Office The documents in part, or in full are stored and retrieved Does not originate from relational database Document used for human consumption Usually information written by hand in a language like PDF, RTF then converted to XML XML-STDS-15 Reality: Hybrid Model CSE 5095 Most documents like a PDF will also contain small grained information (last edited date, character set) Data from a relational DB may even be a document, or require self description Various database technologies support all models Important to understand your data, and choose db technology that is most compatible XML-STDS-16 XML as Data Exchange Medium CSE 5095 Widespread Usage Across Computing UML Tools have Standardized on XML Schema Export Given UML Design to XML Instances Track Both Design Data and Graphical Data Database Interactions via XML Import from XML into a Relational Schema Export form a Relational DB into XML Schema and Instances Web Services Exchange of Information SOAP, WSDL, and UDDI Facilitates Information Exchange and Portability XML-STDS-17 Medical Data Model Medical CSE 5095 data is non-homogeneous But, there exists general trends in medical data: Fine grain data such as dates, times, images Documents and human generated descriptions and observations Human interaction creates semi-structured data Ability to transfer information is esential Medical data fits into hybrid model XML-STDS-18 Data – Centric Comparison Advantages: CSE 5095 Utlizes existing database software. (IBM, Oracle, SqlServer) Quick ( existing db's are already fast) Dual role (not limited only to XML) Many even support XQuery Disadvantages: More configuration (mapping relational -> XML) Slower when creating complex XML files due to middle step XML-STDS-19 Document – Centric Comparison Advantages: CSE 5095 Good integration into workflow Document managment made easy Collaboration, and web publishing Disadvantages: Not able to extract data from document directly Not designed for high availability, high load systems Non-uniformity in implementations XML-STDS-20 Storage Strategy: Relational Utilizing CSE 5095 a relational database to store XML documents and data is very popular In a very data – centric application this approach is intuitive Most top tier database applications support XML in some way Oracle, SQL server, IBM, etc... Software is highly supported and well developed. XML-STDS-21 XML Shema mapping CSE 5095 Using a relational DB requires mapping XML schema to DB schema. Table based: Often implemented as a middleware layer Schema structure must follow row-column convention Object – relational: XML is a tree of objects Mapped to DB using well established OR methods Natively supported in some DB apps XML-STDS-22 Storage Strategy: CMS CSE 5095 CMS – Content Management System Used in exclusively document-centric model Various programs allow indexing, storage, manipulation, and publication of XML documents Application specific Numerous implementations, most recently Open Office and MS Word 2007 Not very interesting or useful in context of biomedical information XML-STDS-23 Storage Strategy: Native CSE 5095 Semi – structured data Mapping to relational DB causes inflation and null space Need more functionality and granularity than CMS Performance increase over relational DB by avoiding joins Assuming data is in appropriate order on disk Only returns XML, need to convert for non XML manipulation Development still in infancy as of Winter 2007 XML-STDS-24 Native XML Databases CSE 5095 Definition: ”A database that has an XML document as its fundamental unit of (logical) storage and defines a (logical) model for an XML document, as opposed to the data in that document, and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order.” Data types: No support in XML, need a mapping Document or database schema can be used External user defined mapping Not necessary when only transfering data No requirement on underlying medium or implementation Two architectures; text and model based XML-STDS-25 Native: Text-based CSE 5095 Use any DB Rather than mapping schemas, store entire XML documents Usually involves saving entire document as a BLOB / Character LOB Utilize various text field searches to retrieve info from XML document Some DB text searching are being made XML aware Speed: Document located on disk preferences full or partial document retrieval XML-STDS-26 Native: Model-based CSE 5095 Internal object model of the document schema Store this model in a database Relational / object-oriented database Proprietary Performance similar to chosen db engine Still limited by hierachy of XML data Retrieve all orderid's from hundreds of docs slow Support for common XML query languages XPath, XQuery, etc... XML-STDS-27 Native XML: TLC CSE 5095 In the traditional database world, Transactions, locking and concurrency are paramount Native XML databases aren't mature enough to support everything Most support transactions, but what about LC? Document level locking is easy, but too coarse. Only a few implementations support node level locking Commercial products generally support ACID, free ones just starting too (2008) Atomicity-Consistency-Isolation-Durability XML-STDS-28 Native XML: API's CSE 5095 Ubiquity of ODBC interfaces Still applies to native XML databases Most implementations provide their own interface for a variety of languages Industry standardization: XML:DB API from XML:DB.org, programming language neutral JSR 225: Xquery API for JAVA (XQJ). IBM and Oracle XML-STDS-29 Native XML: The Rest Referential CSE 5095 integrity is supported in an adhoc manner at best Database cannot enforce user defined (via schema) integrity Some standard mechanisms allow it Eventually both mechanisms will be supported Currently relies heavily on application for normalization and integrity Certainly a drawback for medical applications XML-STDS-30 Native XML: Scalability CSE 5095 Limitation of any DB is time spent seeking HD XML only needs to find pointer to head of doc Therefore an XML DB should scale well in the context of retrieving data The only caviat is if the retrieval breaks the document hierachy More pointers must be followed, potentially slowing retrieval greatly Where there is money, there is a way XML-STDS-31 Biomedical Information Overview CSE 5095 of the field. Data storage and transfer problem. XML as a solution. BMI XML examples. Next section: Choosing a native DB. XML-STDS-32 BMI Overview CSE 5095 The convergence of computation and biomedicine The NIH BMI Science and Tech Initiative: Define biomedical computing as a science Many sources of information: Clinical, surgical, genetics, drug design, biology Standardization in software Algorithm development, high speed computing All relieves on efficient storage and transfer of information XML-STDS-33 BMISTI: Databases CSE 5095 ”Biomedical computing is entering an age where creative exploration of huge amounts of data will lay the foundation of hypotheses.” ~NIH Director Problems: Standards. Terminology, syntax and semantics need to be defined and agreed upon to allow integration of data Curation. Database submissions need to be checked and cross-referenced to avoid the transitive propagation of error Interoperability. Data should be as consistent as possible across databases so that researchers can compare and contrast it Computational and Systems issue: Utilize and manipulate information. Procress large volumes of information. XML-STDS-34 BMI: XML Data CSE 5095 sharing and semantic interoperability Case study: Electronic Health Record The development and use of an integrated health record for a patient Hetergenous data, e.g. clinical, clinical-trial, genomic data Primary Obstacle: Proprietary data formats Uniformity on technical level: Text file Step towards semantic goal XML-STDS-35 XML in Clinical Data HL7 CSE 5095 standards organization. V2: ASCII bar format. example: HL7V3|1|2.02 Message|2.16.840.1.113883.1122^CNTRL-3456|2002081614303516^- ---> 06:00||3.0|2.16.840.1.113883^POLB_IN004410||P|I|ER|ER respondTo|RSP|tel:555-555-5555^^WP entit yRsp|||{FAM^^Hippocrates~GIV^^Harold~GIV^^H~SFX^AC^MD}|tel:555-555-5555^^WP sender|SND|nfs:127.127.127.255 device||2.16.840.1.113883.1122^GHH LAB|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization||\NOTH\ location|||2.16.840.1.113883.1122^ELAB-3|{^^GHH Lab}^TN receiver|RCV|nfs:127.127.127.0 device|||2.16.840.1.113883.1122^GHH O E|{GIV^^An Entit y Name}^L|||tel:555-555-2005^^H agencyFor representedOrganization|||2.16.840.1.113883.19.3.1001|{^^GHH Outpatient Clinic}^TN location|||2.16.840.1.113883.1122^BLDG4|{^^GHH Outpatient Clinic}^TN Awkward, inflexible, unclear meaning of values. XML-STDS-36 HL7 V3 Specification CSE 5095 Built around Reference Information Model: Entity, Role, Participation, and Act Utilizes dedicated vocabularites and data types. Every specification must begin from RIM. Clinical Document Architecture XML with tags like ”observation, code, value and id”. Utilizes <observation classCode="OBS" moodCode="EVN"> <id root="10.23.4573.15879"/> <code code="313193002" codeSystem="2.16.840.1.113883.6.96" codeSystemName="SNOMED CT" displayName="Peak flow"/> <effectiveTime value="20000407"/> <value xsi:type="RTO_PQ_PQ"> <numerator value="260" unit="l"/> <denominator value="1" unit="min"/> </value> </observation> XML-STDS-37 XML in Clinical Trials Example: CSE 5095 Drug studies Utilizing XML would eliminate manual transcription when moving data from one system to another XML is a universal datatype as it stores everything in text Therefore can handle new tech. seamlessly Clinical Data Interchange Standards Consortium Industry standardization XML-STDS-38 CDISC: ODM Operational CSE 5095 Data Model: XML based Facilitate moving data from any collection system to clinical trial sponsor Addresses real world issues: Incomplete data Partial data transfer Versioning and branching ODM 1.1 current version XML-STDS-39 ODM: Layout CSE 5095 XML-STDS-40 XML in Genomic Data Various CSE 5095 groups export their data in XML NCBI, EBI They do not follow same schema, only allows partial semantic interoperability Microarray Gene Experssion Group (MAGE) publishes a schema MAGE files are often several gigabytes Illustrates overhead of XML, however researches still use it because of interoperability XML-STDS-41 XML Complexity CSE 5095 Clinical Genomics Special Interest Group (HL7) Use genomic data in clinical enviroment Utilize several models such as MAGE, BSML (for dna seqs) All information in raw models not necessary ”Bubbling up” analyzes large raw data sets, extracts useful information Transfer useful information to new schema / model Bottom line, there exists complex workflows to extract usable information. XML-STDS-42 XML BMI Issues CSE 5095 Clinical information like a verbal description or advice is unstructured How do you query this? Schemas and Models are extremely complex, with nesting, recursion and compound data types Difficult mapping to relational databases XML instances may be gigabytes in size What database solutions exist to handle such large files? XML-STDS-43 XML BMI Examples A closer CSE 5095 look at the Clinical Document Architecture Mayo clinic's implementation of CDA Case study using native XML database to facilitate research based upon clinical texts Tamino XML DB Querying native BD UCONN BMI, CSE 300 Spring 2008 XML-STDS-44 XML BMI: CDA A clinical document is: Persistence: exists for a defined time period Stewardship: Maintained by a designated care taker Potential for authentication: May be legally authenticated It must be human readable on a standard web browser Utilizes standard XML syntax www.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf CSE 5095 XML-STDS-45 XML BMI: CDA www.hl7.de/iamcda2004/finalmat/day1/Calvin%20Beebe%20CDA%20Update.pdf CSE 5095 Mayo clinics use of CDA: XML-STDS-46 Survey of Native XML DBMS Comprehensive CSE 5095 List: http://www.rpbourret.com/xml/XMLDatabaseProds.h tm#native Commercial: Tamino XML Server Well developed, supported, many tools available Open Source: Sedna: Fully supports ACID, XQuery eXist: Great managment, documentation, indexing XML-STDS-47 eXist http://www.rpbourret.com/xml/ProdsNative.htm#exist CSE 5095 Proprietary data store B+ trees). Supports XQuery/XPath 2.0 Full text searches. XML:DB API. Document level concurrency. Complete documentation. Incomplete transaction support. XML-STDS-48 Sedna http://www.rpbourret.com/xml/ProdsNative.htm#sedna CSE 5095 Underlying data storage based on DataGuide Supports XQuery/XPath 2.0 Full text searches. Custom API for various languages. Command line admin. Transaction support. XML-STDS-49 XML References CSE 5095 “Canonical XML Version 1.0”, John Boyer. 15 March 2001. W3C “XML Path Language (Xpath) 2.0”. W3C working Draft. 2 May 2003. W3C “XML Schema”. XML Schema Working Group. 1 January 2008. W3C <http://www.w3.org/XML/Schema> “XML Schema: Formal Description” Brown, Fuchs, et. al. 25 September 2001. W3C <http://www.w3.org/TR/xmlschema-formal/> “Extensible Markup Language (XML)”. 1 January 2008. W3C <http://www.w3.org/XML/> http://www.25hoursaday.com/StoringAndQueryingXML.html http://www.nih.gov/about/director/060399.htm http://www.research.ibm.com/journal/sj/452/shabo.html “Overview of the CDISC Operational Data Model”. 26 April 2002. CDISC XML-STDS-50 What is one Possible Solution? CSE 5095 Standards and Usage of XML Consider CDA – Clinical Document Architecture Standard for Clinical (Provider) Medical Record Clinical Record Organized as: <patient_encounter> - location <legal_authenticator> - MD <originating_organization> and <provider> <patient> - name, birthdate, gender <body_confidentiality-”CONF1”> - note History Past Medical History Medications Allergies Social History Physical Exam Vitals (BP, Resp, Temp, HR) Etc... XML-STDS-51 What is one Possible Solution? CSE 5095 Let’s Explore this in Greater Detail Starting with the CDA Header <?xml version="1.0"?> <!DOCTYPE levelone PUBLIC "-//HL7//DTD CDA Level One 1.0//EN" "levelone_1.0.dtd"> <levelone> <clinical_document_header> <id EX="a123" RT="2.16.840.1.113883.3.933"/> <set_id EX="B" RT="2.16.840.1.113883.3.933"/> <version_nbr V="2"/> <document_type_cd V="11488-4" S="2.16.840.1.113883.6.1" DN="Consultation note"/> <origination_dttm V="2000-04-07"/> <confidentiality_cd ID="CONF1" V="N" S="2.16.840.1.113883.5.1xxx"/> <confidentiality_cd ID="CONF2" V="R" S="2.16.840.1.113883.5.1xxx"/> <document_relationship> <document_relationship.type_cd V="RPLC"/> <related_document> <id EX="a234" RT="2.16.840.1.113883.3.933"/> <set_id EX="B" RT="2.16.840.1.113883.3.933"/> <version_nbr V="1"/> </related_document> </document_relationship> <fulfills_order> <fulfills_order.type_cd V="FLFS"/> <order><id EX="x23ABC" RT="2.16.840.1.113883.3.933"/></order> <order><id EX="x42CDE" RT="2.16.840.1.113883.3.933"/></order> </fulfills_order> XML-STDS-52 CDA Example - Continued CSE 5095 XML-STDS-53 CDA Example - Continued CSE 5095 XML-STDS-54 CDA Example - Continued CSE 5095 XML-STDS-55 CDA Example - Continued CSE 5095 XML-STDS-56 CDA Example - Continued CSE 5095 XML-STDS-57 CDA Example - Continued CSE 5095 XML-STDS-58 CDA Example - Continued CSE 5095 XML-STDS-59 CDA Example - Continued CSE 5095 XML-STDS-60 Other Relevant Standards of Note CSE 5095 MeSH Unified Medical Language System ICD9 and ICD9-CM (Intl. Classification Diseases) ICD10 and ICD10-CM SNOMED-CT (Clinical Terms) National Drug Codes (NDC) XML-STDS-61 MeSH CSE 5095 The Medical Subject Headings (MeSH®) thesaurus is a controlled vocabulary produced by the National Library of Medicine and used for indexing, cataloging, and searching for biomedical and health-related information and documents. 2011 MeSH includes the subject descriptors appearing in MEDLINE®/PubMed®, the NLM catalog database, and other NLM databases. Many synonyms, near-synonyms, and closely related concepts are included as entry terms to help users find the most relevant MeSH descriptor for the concept they are seeking. http://www.nlm.nih.gov/mesh/ XML-STDS-62 Descriptor Data Elements CSE 5095 XML-STDS-63 Qualifier Data Elements CSE 5095 XML-STDS-64 Supplementary Concepts CSE 5095 XML-STDS-65 MeSH in ASCII CSE 5095 *NEWRECORD RECTYPE = D MH = Calcimycin AQ = AA AD AE AG AI AN BI BL CF CH CL CS CT DU EC HI IM IP ME PD PK PO RE SD ST TO TU UR ENTRY = A-23187|T109|T195|LAB|NRW|NLM (1991)|900308|abbcdef ENTRY = A23187|T109|T195|LAB|NRW|UNK (19XX)|741111|abbcdef ENTRY = Antibiotic A23187|T109|T195|NON|NRW|NLM (1991)|900308|abbcdef ENTRY = A 23187 ENTRY = A23187, Antibiotic MN = D03.438.221.173 PA = Anti-Bacterial Agents PA = Ionophores MH_TH = NLM (1975) ST = T109 ST = T195 N1 = 4-Benzoxazolecarboxylic acid, 5-(methylamino)-2-((3,9,11trimethyl-8-(1-methyl-2-oxo-2-(1H-pyrrol-2-yl)ethyl)-1,7dioxaspiro(5.5)undec-2-yl)methyl)-, (6S(6alpha(2S*,3S*),8beta(R*),9beta,11alpha))RN = 52665-69-7 PI = Antibiotics (1973-1974) PI = Carboxylic Acids (1973-1974) XML-STDS-66 MeSH in ASCII CSE 5095 MS = An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems. OL = use CALCIMYCIN to search A 23187 1975-90 PM = 91; was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) HN = 91(75); was A 23187 1975-90 (see under ANTIBIOTICS 1975-83) MED = *62 MED = 847 M90 = *299 M90 = 2405 M85 = *454 M85 = 2878 M80 = *316 M80 = 1601 M75 = *300 M75 = 823 M66 = *1 M66 = 3 ETC XML-STDS-67 MeSH in XML - desc2011.dtd <!-- MeSH DTD file for Descriptor records. desc2011.dtd --> --> CSE <!-- Author: MeSH --> 5095 <!-- Effective: 09/01/2010 <!-- #PCDATA: parseable character data = text occurence indicators (default: required, not repeatable): ?: zero or one occurrence, i.e., at most one (optional) *: zero or more occurrences (optional, repeatable) +: one or more occurrences (required, repeatable) |: choice, one or the other, but not both --> <!ENTITY % DescriptorReference "(DescriptorUI, DescriptorName)"> <!ENTITY % normal.date "(Year, Month, Day)"> <!ENTITY % ConceptReference "(ConceptUI,ConceptName,ConceptUMLSUI?)"> <!ENTITY % QualifierReference "(QualifierUI, QualifierName)"> <!ENTITY % TermReference "(TermUI, String)"> XML-STDS-68 MeSH in XML - desc2011.dtd <!ELEMENT DescriptorRecordSet (DescriptorRecord*)> CSE <!ATTLIST DescriptorRecordSet LanguageCode 5095 (cze|dut|eng|fin|fre|ger|ita|jpn|lav|por|scr|slv|spa) #REQUIRED> <!ELEMENT DescriptorRecord (%DescriptorReference;, DateCreated, DateRevised?, DateEstablished?, ActiveMeSHYearList, AllowableQualifiersList?, Annotation?, HistoryNote?, OnlineNote?, PublicMeSHNote?, PreviousIndexingList?, EntryCombinationList?, SeeRelatedList?, ConsiderAlso?, PharmacologicalActionList?, RunningHead?, TreeNumberList?, RecordOriginatorsList, ConceptList) > <!ATTLIST DescriptorRecord DescriptorClass (1 | 2 | 3 | 4) "1"> XML-STDS-69 MeSH in XML - desc2011.dtd <!ELEMENT CSE <!ELEMENT 5095 <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT ActiveMeSHYearList (Year+)> AllowableQualifiersList (AllowableQualifier+) > AllowableQualifier (QualifierReferredTo,Abbreviation )> Annotation (#PCDATA)> ConsiderAlso (#PCDATA) > Day (#PCDATA)> DescriptorUI (#PCDATA) > DescriptorName (String) > DateCreated (%normal.date;) > DateRevised (%normal.date;) > DateEstablished (%normal.date;) > DescriptorReferredTo (%DescriptorReference;) > <!ELEMENT EntryCombinationList (EntryCombination+) > <!ELEMENT EntryCombination (ECIN, ECOUT)> <!ELEMENT ECIN (DescriptorReferredTo,QualifierReferredTo) > <!ELEMENT ECOUT (DescriptorReferredTo,QualifierReferredTo? ) > <!ELEMENT HistoryNote (#PCDATA)> <!ELEMENT Month (#PCDATA)> <!ELEMENT OnlineNote (#PCDATA)> ETC XML-STDS-70 dMeSH in XML - Sample <?xml version="1.0"?> CSE <!DOCTYPE DescriptorRecordSet SYSTEM "desc2011.dtd"> 5095 <DescriptorRecordSet LanguageCode = "eng"> <DescriptorRecord DescriptorClass = "1"> <DescriptorUI>D000001</DescriptorUI> <DescriptorName> <String>Calcimycin</String> </DescriptorName> <DateCreated> <Year>1974</Year> <Month>11</Month> <Day>19</Day> </DateCreated> <DateRevised> <Year>2006</Year> <Month>07</Month> <Day>05</Day> </DateRevised> <DateEstablished> <Year>1984</Year> <Month>01</Month> <Day>01</Day> </DateEstablished> XML-STDS-71 dMeSH in XML - Sample <ActiveMeSHYearList> <Year>2007</Year> <Year>2008</Year> CSE <Year>2009</Year> 5095 <Year>2011</Year> </ActiveMeSHYearList> <AllowableQualifiersList> <AllowableQualifier> <QualifierReferredTo> <QualifierUI>Q000008</QualifierUI> <QualifierName> <String>administration &amp; dosage</String> </QualifierName> </QualifierReferredTo> <Abbreviation>AD</Abbreviation> </AllowableQualifier> <AllowableQualifier> <QualifierReferredTo> <QualifierUI>Q000009</QualifierUI> <QualifierName> <String>adverse effects</String> </QualifierName> </QualifierReferredTo> <Abbreviation>AE</Abbreviation> </AllowableQualifier> ETC XML-STDS-72 Unifies Medical Language System CSE 5095 UMLS acronym for was developed for National Library of Medicine Disease is semantic type with around 392 relations (109 semantic relations and 22 other relations). Pneumonia categorized under one semantic type Disease, but has hundreds of relations. XML-STDS-73 UMLS Concepts, Semantic Types/Relations CSE 5095 XML-STDS-74 ICD9 Respiratory Diseases CSE 5095 XML-STDS-75 ICD10 Respiratory Diseases CSE 5095 XML-STDS-76 SNOMED-CT CSE 5095 SNOMED stands for Systemized Nomenclature Of Medicine Clinical Terms. SNOMED-CT is the result of merging two ontologies: SNOMED-RT and Clinical Terms. http://www.ihtsdo.org/snomed-ct/ 77 XML-STDS-77 SNOMED-CT CSE 5095 Composed of Concepts, Terms, and Relationships Precisely Represent Clinical Information Across Scope of Health Care Content Coverage Divided into Hierarchies 78 XML-STDS-78 SNOMED Example CSE 5095 XML-STDS-79 National Drug Codes CSE 5095 Tracking of Drugs (Prescription and OTC) From Submittal Through Approach Keeps Track of Many Details on Medication Each Drug by Manufacturer has Unique NDC Identifier See: http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm Searchable Database: http://www.accessdata.fda.gov/scripts/cder/ndc/default.cfm XML-STDS-80 NDC Examples CSE 5095 XML-STDS-81 Biomedical & Clinical Ontologies CSE 5095 Evolution of WWW Ontology Definition and Description. Example. Present Biomedical Ontology Need for Integration Application of Biomedical Ontology Clinical Trials OASIS: Integration Technique Clinical Decision Support System Summary Presentation from Rishi Saripalle, Spring 2008 82 XML-STDS-82 Current Information Systems on WWW CSE 5095 First Generation: Raw data which was pretty much hand-coded by the user was published online For example, Static web pages Second Generation: Dynamic content generation driven by MDA and databases Machines generate the respective HTML Third Generation: Semantic Web: Generating machine processable information where the content is machine understandable, enabling intelligent services such as information brokers, search agents, information filters to process domain related information. XML-STDS-83 What are Ontologies? CSE 5095 Definition (from Philosophy) : Ontology is study of being or existence and forms the basic subject matter of metaphysics. It seeks to describe the basic categories and relationships of being or existence to define entities and types of entities within its framework. Definition (from Computer Science): In Computer science , Ontology means “specification of a conceptualization”. It means “A data model that represents a set of concepts within a domain and the relationships between those concepts”. XML-STDS-84 Advantages of Ontology CSE 5095 Semantic way of representing knowledge of the domain Intelligent system can provide reasoning Systems to make inferences within the Ontology To Share the common structure of information To reuse the similar domain Ontology XML-STDS-85 Development of Ontology CSE 5095 Determine the domain and Scope ( Range ) of the knowledge Look for already existing ontology in the similar domain Listing all the terminologies or Concepts of the domain List all the classes and instances to be created in the ontology Create the properties which will relate these concepts in the ontology XML-STDS-86 Example of Ontology CSE 5095 Wine Australian Yellow Tail Individual Class Properties Color Yellow Flavor Delicate Maker Australia German XML-STDS-87 What are RDF and OWL? CSE 5095 Researchers proposed Semantic Web Stack illustrating hierarchy of languages, where each layer exploits and uses capabilities of the layers below OWL and RDF belong the family of knowledge representation language. RDF: Resource Description Framework http://www.w3.org/RDF/ OWL: Web Ontology Language http://www.w3.org/TR/owl-features/ RDF reminds of Semantic Networks which were popular in 1970’s XML-STDS-88 Introduction to RDF / OWL CSE 5095 XML-STDS-89 RDF: Resource Description Framework CSE 5095 RDF represents the knowledge in triples format: Subject – Predicate – Object For example, Students – registerTo – Classes (Subject) (Predicate) (Object) One triple is RDF is referred as a statement RDF is grammar based language has syntax similar to XML RDFS (RDF Schema) has syntax similar to RDF and provide schema grammar to RDF. For example, rdfs:Class, rdfs:subClassOf etc XML-STDS-90 RDF: Resource Description Framework CSE 5095 RDF syntax of the above example: <rdfs:Class rdf:about="http://www.example.com/examle#Students" rdfs:label="Students"> </rdfs:Class> <rdfs:Class rdf:about="http://www.example.com/examle#Classes" rdfs:label=“Classes"> </rdfs:Class> All the concepts described in the RDF are identified using an URI (ex. http://www.example.com/examle#Students). RDF can be viewed as standardized framework for providing metadata to domain concepts. XML-STDS-91 OWL: Web Ontology Language CSE 5095 OWL is placed on the top of the semantic web stack, utilizing all the powerful features offered by the layers below (RDF, RDFS, XML) OWL design has been influenced by description logic & knowledge representational paradigms SHIQ, Semantic Networks, Frames, SHOE, DAML, OIL, DAML+OIL. OWL provides richer semantic capabilities than its predecessor RDF For example, in the previous example, the predicate registerTo is of type rdf:Property. XML-STDS-92 OWL: Web Ontology Language CSE 5095 OWL differentiates between properties by defining owl:ObjectProperty – for connecting two concepts (registerTo) and owl:DatatypeProperty - for connecting a concept to a datatype (utilized from XML) These two properties inherit from RDF property OWL also defines owl:AnnotationProperty for embedding metadata onto classes, rules and axioms The following slide illustrates the use of OWL, RDF and RDFS ( taken from cardiac ontology build in OWL using protégé tool) XML-STDS-93 OWL: Web Ontology Language <owl:Class rdf:ID="Veins"> <rdfs:subClassOf> <owl:Class rdf:ID="Heart"/> </rdfs:subClassOf> </owl:Class> <Veins rdf:ID="Pulmonary_Vein"/> CSE 5095 Heart Vein Pulmonary Vein Pulmonary Vein is sub-class of Vein which is subclass of Heart. The next slide illustrates the OWL properties and expressive power of OWL to restrict the domain and range values accepted by these properties. BioMedical Informatics XML-STDS-94 OWL: Web Ontology Language <owl:ObjectProperty rdf:ID="Complications"> <rdfs:domain rdf:resource="#Cardiology_Diseases"/> <rdfs:range> <owl:Class> <owl:unionOf rdf:parseType="Collection"> <owl:Class rdf:about="#Cardiology_Complications"/> <owl:Class rdf:about="#Cardiology_Diseases"/> <owl:Class rdf:about="#Cardiology_Causes"/> </owl:unionOf> </owl:Class> </rdfs:range> </owl:ObjectProperty> CSE 5095 The object property “Complications” can take domain values from class “Cardiology_Diseases” and range values from combination of classes OWL combined with RDF/RDFS provides an environment for developing domain ontologies by organizing and describing the domain concepts BioMedical Informatics XML-STDS-95 Disease Ontology CSE 5095 Instances of Mitral_Valve_Disorders Hierarchical organization of Cardiology Diseases XML-STDS-96 Disease Ontology CSE 5095 Property Defined Representation of “Mitral_Valve_Prolapse” knowledge using properties and instances XML-STDS-97 Implemented Ontology in OWL Format ………….. CSE 5095 <Congenital_Heart_Disease rdf:ID="Atrial_septal_defect"> <Complications> <Cardiac_Arrhythmias rdf:ID="Arrhythmia"> <Has_Intervention rdf:datatype="http://www.w3.org/2001/XMLSchema#string" >defibrillation</Has_Intervention> <Have_Symptoms> <Cardiology_Symptoms rdf:ID="Dyspnea"/> </Have_Symptoms> <Has_Diagnosis_Test> <Cardiology_Diagnosis_Test rdf:ID="Coronary_Angiography"> <Has_Synonyms rdf:datatype="http://www.w3.org/2001/XMLSchema#string" >coronary catheterization </Has_Synonyms> ……………….. XML-STDS-98 Bio-Medical Ontologies CSE 5095 Review a Wide Range of Available Ontologies and Standards: OpenCyc WordNet Galen UMLS SNOMED – CT FMA Gene Ontology XML-STDS-99 Open Cyc CSE 5095 Open Cyc is an Upper level ontology developed by Cycorp Inc. Open Cyc has 60,000 hand coded assertions that capture “common sense language”, so that AI algorithms can perform human like reasoning and contains 6,000 concepts XML-STDS-100 Example of Open Cyc CSE 5095 XML-STDS-101 Word Net CSE 5095 WordNet is an electronic lexical database developed at Princeton University that serves as a resource for applications in natural language processing and information retrieval. cancer, malignant neoplastic disease: any malignant growth or tumor caused by abnormal and uncontrolled cell division; it may spread to other parts of the body through the lymphatic system or the blood stream Cancer, Crab: (astrology) a person who is born while the sun is in Cancer Cancer: a small zodiacal constellation in the northern hemisphere; between Leo and Gemini Cancer, Cancer the Crab, Crab: the fourth sign of the zodiac; the sun is in this sign from about June 21 to July 22 Cancer, genus Cancer: type genus of the family Cancridae XML-STDS-102 Unifies Medical Language System CSE 5095 UMLS was developed for National Library of Medicine Disease is semantic type with around 392 relations (109 semantic relations and 22 other relations). Pneumonia categorized under one semantic type Disease, but has hundreds of relations. XML-STDS-103 SNOMED-CT CSE 5095 SNOMED stands for Systemized Nomenclature Of Medicine Clinical Terms. SNOMED-CT is the result of merging two ontologies: SNOMED-RT and Clinical Terms. XML-STDS-104 Ontology Integration CSE 5095 All the ontologies developed have a common aim, describing the domain knowledge Integration of ontologies is becoming very critical Applications tend to use multiple ontologies Concepts in the various ontologies overlap or same concept is described in multiple ways. For example, the concept “Blood” is described as differently “Fluid” in one ontology “Substance” in another ontology “semi-solid” in a third ontology Need to Reconcile these Differences When Attempting to “Combine” data that Originates from Different Ontologies XML-STDS-105 Ontology Integration CSE 5095 Semantics vs Structural Integration ? Difficulties of integration arise with similar, same and complementary ontology integration. Ontology B XML-STDS-106 OASIS Ontology Mapping and Integration Framework CSE 5095 XML-STDS-107 Application of Ontologies CSE 5095 Randomized Clinical Trails: one of the least biasedsources of clinical research evidence, and are therefore a critical resource for the practice of evidence-based medicine Scientific community is trying to encode the finding in computer process able language However, for evidence to be put in practice one has to analysis the data. The canonical practice for trial interpretation is call System Reviewing. Source for Data Specification: Trial Reports Trial Databases. XML-STDS-108 Life Cycle of Clinical Trials CSE 5095 Ontology Specifications XML-STDS-109 Designing the Ontology CSE 5095 RCT ontology specifications are obtained from: Trial Reports Trial Databases - ClinicalTrials.gov, PDQ etc. The ontology is created by dividing the task into SubTasks and Methods. This recursive process is called Competency Decomposition. RCT decomposition methods combined Generic Tasks and Competency Question. XML-STDS-110 Defining the Schema CSE 5095 ……. Intervention -ARM TRAIL ……. Administrative Concept OutcomeConcept Population 188 - Frames 601 - Slots ……. ……. Excluded Population Analyzed Population XML-STDS-111 Matching Patient Records to Clinical Trials CSE 5095 Low participation in Clinical Trials is the major problem in Clinical and translational research area. Matching the patient records to clinical trials is presently a manual procedure and its tedious. Need a Semantic Bridge between Clinical Ontologies (SNOMED CT, etc ..) and raw patient data for retrieving matching patient records, clinical guidelines and clinical decision support systems ( CDSS). XML-STDS-112 Technical Challenges CSE 5095 Challenges to be faced during real time scenario: Knowledge Engineering. Scalability Noisy or Incomplete Data Knowledge Engineering Clinical Ontology has the concept “Drug”, which described active composition of the various drugs However, patient record contains name of vendorspecific drugs list Clinical Ontology describe the cause of the disorder. The patient records only specify the presence or absence of the disorder and where was the clinical test conducted. XML-STDS-113 Architecture of Solution CSE 5095 Clinical Trials Patient Data SNOMED-CT Query Ontology ABox Reasoner TBox XML-STDS-114 Implementation Approach CSE 5095 Mapping Patient Data Terminology to SNOMED-CT Using UMLS as intermediate target. NLP mapping techniques Manual Mapping Map the raw patient data to SNOMED-CT terminology. Example: Cerner Drug: Lactulose Syrup 20G/30ml SNOMED-CT: administeredSubstance Allow user to specify which terms in the definition to be matched. Last Bullet Means Ontology Matching NOT Fully Automated! This is a Real Problem for Interoperating Data! XML-STDS-115 Contrast in Representation CSE 5095 Example: SNOMED-CT: Disease1 hasAgent Virus007 Infection due to Bacteria001 Infection due to MicroBacteria007 Patient Record: Disease1 Positive. As there is not much information in the patient record the query reasoner cannot find the records with partial data. XML-STDS-116 How are Observations Reconciled? CSE 5095 Clinical Trials Description NCT00084266 Patients with MSRA NCT00288808 Patients with warfarin NCT00298870 Patients on steroids NCT00304382 Patients with Pneumonia,source of Blood or Sputum Э associatedObservation MRSA Э associatedObservation Pneumococcal Penumonia П Э hasSpecimanSource Blood Ц Sputum XML-STDS-117 Clinical Decision Support System CSE 5095 Clinical Decision Support Systems (CDSS) are Interactive computer programs Designed to assist physicians and other health professionals with decision making tasks Components of CDSS: Knowledge Base Rule Based Engine Case Base Business Models XML-STDS-118 Example of Usaeg of Rules CSE 5095 IF “ RULE 1” &“RULE 2” &“RULE 3” ….. “Rule n” THEN “INTERVENTION 1 or Rule M” IF p.getGender() = “male” & p.getAge()=34 & p.getBP() <140 & p.getInsulinLevel()<20 THEN “ Asthma Intervention Level 2” Class Patinet HasGender “male” П hasAge “34” П hasBP MoreThan 140 П hasInsulinLevel MoreThan 20 XML-STDS-119 Summary - Ontologies CSE 5095 Ontology Definition and Descriptions. Example. Biomedical Ontology Open Cyc WordNet GALEN SNOMED - CT Integration of Ontologies Application of Biomedical Ontology Clinical Trials. OASIS: Integration Technique. Clinical Decision Support System. XML-STDS-120 Concluding Remarks: XML/Standards CSE 5095 Explored Usage of XML Including: Basic XML Concepts XML Tools and Standards XML Databases Use of XML in BMI Reviewed HL7 and CDA Examined Numerous Standards Reviewed Ontology Concepts XML-STDS-121