Some thoughts on PATO Chris Mungall BBOP Hinxton May 2006 Outline Motivation revisited The Ontology: PATO OBD & using PATO for annotation Who should use PATO? Originally: model organism mutant phenotypes But also: ontology-based evolutionary systematics neuroscience; BIRN clinical uses OMIM clinical records to define terms in other ontologies e.g. diploid cell; invasive tumor, engineered gene, condensed chromosome Unifying goal: integration Integrating data within and across these domains across levels of granularity across different perspectives Requires Rigorous formal definitions in both ontologies and annotation schemas Some thoughts on the ontology itself Outline Definitions how do we define PATO terms? what exactly is it we’re defining? is_a hierarchy what are the top-level distinctions? what are the finer grained distinctions? shapes and colors It’s all about the definitions Everything is doomed to failure without rigorous definitions even more so with PATO than other ontologies OBO Foundry Principle Definitions should describe things in reality, not how terms are used def should not use the word ‘describing’ Should we come up with a policy for definitions in PATO currently: 19 defs (2.5 are circular) proposed breakout session: examine all these consistency: the property of holding together and retaining shape amplitude: The size of the maximum displacement from the 'normal' position, when periodic motion is taking place placement: The spatial property of the way in which something is placed pointed value: A sharp or tapered end epinastic value: A downward bending of leaves or other plantnparts oblong value: Having a somewhat elongated form withnapproximately parallel sides elliptic value: Elliptic shapen hearted value: Heart shaped fasciated value: Abnormally flattened or coalescedn opacity: The property of not permitting the passage of electromagnetic radiatio opaque value: Not clear; not transmitting or reflecting light or radiant energy undulate value: Having a sinuate margin and rippled surface permeability: The property of something that can be pervaded by a liquid (as by osmosis or diffusion) porosity: The property of being porous; being able to absorb fluids porous value: able to absorb fluids viscosity: a property of fluids describing their internal resistance to flow viscous value: a relatively high resistance to flow. latency: The time that elapses between a stimulus and the response to it power: The rate at which work is done Proposal: genus-differentia definitions An S is a G which D Each def should refine the is_a parent Single is_a parent Example: (non-PATO) binucleate cell def= a cell which has two nuclei Example (proposed PATO def): convex shape def= a shape which has no indentations opacity def= an optical quality which exists by virtue of the bearer’s capacity to block the passage of electromagnetic radiation v similar to existing def This policy will reap benefits Advantages: Helps avoid circularity Ensures precision Consistency in wording user-friendly Considerations: Sometimes leads to awkward phrasing -ity suffix - “an opacity which…” Solution: allow shortened gerund form having…, being…., …. most of the existing defs conform already implicit prefix “A G which exists by virtue of the bearer…” From the top down First, the fake term ‘pato’ must be removed How do we define ‘attribute’? Note: I prefer the term ‘quality’ or ‘property’ attribute implies attribution length_in_centimetres is an attribute we can of course continue to say ‘attribute’ but I use ‘quality’ in these slides most of new new pato defs are phrased as ‘a property of…’ which I like, but inconsistent with calling the root ‘attribute’ Well then, what is a quality/property? What a quality is NOT Qualities are not measurements Instances of qualities exist independently of their measurements Qualities can have zero or more measurements These are not the names of qualities: percentage process abnormal high Some examples of qualities The particular redness of the left eye of a single individual fly An instance of a quality type The color ‘red’ A quality type Note: the eye does not instantiate ‘red’ PATO represents quality types PATO definitions can be used to classify quality instances by the types they instantiate the type “red” instantiates the particular case of redness (of a particular fly eye) the type “eye” instantiates an instance of an eye inheres (in a particular fly) in (is a quality of, has_bearer) Qualities are dependent entities Qualities require bearers Bearers can be physical objects or processes Example: A shape requires a physical object to bear it If the physical object ceases to exist (e.g. it decomposes), then the shape ceases to exist Some qualities are relational they relate a bearer with other entities e.g. sensitivity (to) Compare with: functions The PATO hierarchy Proposal for a new top level division Proposal for granular divisions Proposal 1: top level division Spatial quality Definition: A quality which has a physical object as bearer Examples: color, shape, temperature, velocity, ploidy, furriness, composition, texture Spatiotemporal quality Definition: A quality which has a process as bearer Examples: rate, periodicity, regularity, duration Proposal 2: subsequent divisions Based on granularity (i.e. size scale) a good account of granularity is vital for inferences from molecular (gene) level to organismal (disease) level How do we partition the levels? Some qualities are realised at certain levels of granularity Others can be realised across levels shape, porosity Sum-of-parts vs emergent Scale Bearer Quality Definition (proposed) Physical Cont. Mass Equivalent to the sum of the mass of the parts of the bearer (mass at the particle level is primitive/outwith PATO) Physical Cont. Opacity An optical quality manifest by the capacity of the bearer to block light Phys/Che m Liquid Concentration A compositional relational quality manifest by the relative quantity of some chemical type contained by the bearer Molecular Gene splicing quality manifest by the splicing processes undergone by the bearer Cellular Cell ploidy A cellular quality manifest by the number of genomes that are part of the bearer Cellular Cell transformative potency?? A cellular quality manifest by the capacity of the bearer cell to differentiate to different cell types Scale Bearer Quality Cont. morphology _ shape __ 2D shape __ 3D shape Definition (proposed) A morphological quality which is manifest Granular hierarchy quality spatial quality spatial physical and physico-chemical quality mass, concentration spatial biological quality spatial molecular quality spatial cellular quality spatial organismal quality spatial quality, multiple scales morphology/form optical quality color, opacity, fluorescence Advantages of dividing by granularity Modular strategic question should we focus on biological qualities and work with others on morphology, physics-based qualities etc? Good for annotation easy to constrain at high level e.g. organismal qualities cannot be borne by molecules Mirrors GO and OBO Foundry divisions Easier to find terms to be proved, but I believe so Considerations Possible objection: The upper level of an ontology is what the user sees first terms such as “cross-granular quality” may be perceived as undesirable and/or abstruse by some users Counter-argument Solvable using ontology views aka subsets, slims Relative and absolute Currently PATO terms often come in 3s: e.g. mass, relative mass, absolute mass Why do we need these? PATO: One or two hierarchies? Currently two hierarchies attribute value My position: there should be one hierarchy of qualities My compromise: it should be possible to transform PATO automatically into a single hierarchy attribute Current PATO value color colorV hue sat. var. hueV sat.V var.V blueV darkV paleV is_a … range blackV attribute Proposed change attribute color color hue sat. var. hue sat. var. blue dark pale is_a … black Arguments for a single hierarchy Practical elimination of redundancy no clear line for deciding what should be A and what should be V shape, bumpy vs bumpiness Ontological what kind of thing is a ‘value’? Diederich 1997: [quote here] Arguments against Two hierarchies reflect cognitive and linguistic structures e.g. the color of the rose changed from red to brown 3 cognitive artifacts we want to present data in a way that is natural to users …but this can be solved with a single collapsed hierarchy Two are useful for cross-products see later - distinguish modifiers from values EAV is common database pattern so…? Compromise: transformations The Two Hierarchies approach is workable if they can be automatically collapsed Prerequisite: univocity Each ‘value’ must be defined to mean exactly one thing only i.e. Each ‘value’ must be the ‘range’ of a single attribute Example having a value ‘fast’ that could be applied to both the spatial quality ‘velocity’ and the process quality ‘duration’ would be forbidden attribute Collapse on ‘ranges’ value color colorV hue sat. var. hueV sat.V var.V blueV darkV paleV is_a … range blackV Shapes and colors How many types of shape are there? notched, T-shaped, Y-shaped, branched, unbranched, antrose, retrose, curled, curved, wiggly, squiggly, round, flat, square, oblong, elliptical, ovoid, cuboid, spherical, egg-shaped, rodshaped, heart-shaped, … How do we define them? How do we compare them? Is it worth the effort? Shape types need precise definitions to be useful Real shapes are not mathematical entities but mathematical definitions can help Axes of classification: Dimensionality 2-4D (process “shapes”) concave vs convex angular vs non-angular number of sides corners Primitive and composed shapes Work with morphometrics community? Shape likeness We can post-coordinate some shape types egg-shaped head-shaped A2-segment-shaped Dangers of circularity Only for genuine likeness (e.g. homeotic transformation) not “heart-shaped leaf” See annotation section of this presentation Color Keep PATO HSV model but is black a color hue? We should allow overlapping partitions of color space different domains have ‘sub-terminologies’ of color Is color relational? Humans vs tetrachromatic UV-seeing animals Composition using has_part Color hierarchy Physical quality Optical quality: a physical quality which exists in virtue of the bearer interacting with visible electromagnetic radiation Chromatic quality: an optical quality which exists in virtue of the bearer emitting, transmitting or reflecting visible electromagnetic radiation Color hue Color saturation Color variation Color Opacity: an optical quality which exists in virtue of the bearer aborbing visible electromagnetic radiation opaque translucent transparent Part 2: Annotation using PATO Annotation scheme desiderata OBD Dataflow Proposed annotation scheme Annotation scheme desiderata Rigour There is a subset of the scheme which is simple The entire scheme is expressive It should have an unambiguous mapping to real world entities Even if PATO is completely unambiguous, an illdefined annotation scheme may leave room for ambiguity Example: Annotation: E=eye, Q=red What does this mean? both eyes are red in this one fly instance at least one eye is red in this one fly instance a typical eye is red in this many-eyed spider both eyes are red in this one fly at some point in time both eyes are red in this one fly at all times all eyes are red in all flies in this experiment some eyes are red in some flies in this experiment There should be a certain usable subset that is simple Rationale - MODs have limited resources: building entry tools for simple subsets is easier building databases and query/search engines is easier curating with a less expressive formalism is easier, faster and requires less training MODs primary use case is search, for which expressivity is less useful Specifics Tools should have an (optional) simple facade Simple annotations should be expressible in a simple syntax that is understood by users with relatively little training There should be an exchange format and/or database schemas that use traditional technology as might be used in a MOD eg XML, relational tables The scheme must be highly expressive Rationale May be required by other NCBCs (BIRN) May be required for cbio 200 gene list Will be required in future Specifics Expressive superset will be optional MODs can ‘pick and choose’ their subset Native exchange and storage format will be logicbased Details outwith scope of this presentation Dataflow How will various kinds of phenotypic data get into OBD? what kinds of data suppliers will use different formalisms? 3 scenarios… (more possible) Example dataflow I generic MOD curators annotates phenotypes using Phenote Annotations stored directly in MOD’s central DB MOD periodically submits to OBD eg using Phenote to create pheno-xml OBD converts pheno-xml to native logicbased formalism Users can query MOD directly, or OBD OBD will allow more expressive queries and have more data integrated Example dataflow 2 Non-MOD generates complex annotations and stores them locally e.g. BIRN group? Periodic submissions to OBD e.g. as OWL or Obo-format instance data OBD converts to native logic-based formalism Users can query OBD using more complex queries Example dataflow 3 cBio MOD curates 200 genes using Phenote Annotations may be stored outside normal MOD schema schema may not be expressive enough for complicated phenotypes TBD - up to MOD Periodic submissions to OBD Phenote can be used to submit pheno-xml, OWL or OBO MOD doesn’t have to worry about format OBD converts to native formalism Users can query OBD using relatively complex queries Is this (should it be) different from #1? MOD A MOD B pheno-detailed XML file OBD MOD C Non-MOD Proposed annotation schema The schema will be described informally using a simple syntax I use ‘E’ for entity and ‘Q’ for quality Pretend it is EAV if you like with implicit superfluous ‘A’ The schema has (will have) a formal interpretation aim: database exchange and removal of ambiguities can be expressed using logical language OBD will use an internal logic-based representation Outline of annotation schema ‘EAV’ or ‘EQ’ is not enough Fine for (very) simple subset Extensions: time relational qualities post-coordination of entity types count qualities measurements … Standard case: monadic qualities Examples E=kidney, Q=hypertrophied autodef: a kidney which is hypertrophied We assume that there is more contextual data (not shown) e.g. genotype, environment, number of organisms in study that showed phenotype Interpretation (with the rest of the database record): all fish in this experiment with a particular genotype had a hypertrophied kidney at some Quantification long thick thoracic bristles 2 statements E=thoracic bristle, Q=long E=thoracic bristle, Q=thick Default interpretation A typical thoracic bristle is long and thick Optional entity quantifiers EQuant={some,all,most,<percentage>,<count>} E=thoracic bristle, Q=long, EQuant=80% 80% of the thoracic bristles in this one individual fly OBD internal representation Time Example: E=brain,Q=small,during=stage A E which has quality that instantiates Q during T E has the quality Q for some extent of time, and that extent overlaps T during and other temporal relations will come from the OBO Relations ontology Relational qualities E.g. sensitivity E=eye, Q=sensitive, E2=red light Post-coordinating entity types E=blood in head Q=pooled Problem: The E may not be pre-defined (pre-coordinated, pre-composed) in the anatomy ontology We can post-compose a type representation (aka make a cross-product) E=(blood has_location(head)) The ability to post-coordinate may not be available in the ‘simple-subset’ can be expressed easily in pheno-xml, obo, owl, phenote(soon) OBD will handle all required reasoning Pre-coordinating phenotypes Mammalian phenotype ontology has precoordinated phenotype terms osteoporosis pink fur OBD will be able to translate post-coordinated queries to annotations on predefined terms queries on pre-defined terms to post-coordinated phenotypes Requirement computable logical definitions are added to MP Count qualities wingless polydactyly spermatocytes devoid of asters Absence can never be instantiated wingless E=wing, Q=absent autodef “an instance of wing which is absent” Proposal: restate as: E=mesothoracic segment, Q=missing part, E2=wing This has other advantages works better for “spermatocyte devoid of asters” The quality of ‘being many’ does not inhere in a finger Polydactyly E=finger, Q=supernumerary autodef: “a finger which is supernumerary” Restate as: E=hand, Q=supernumerary parts, E2=finger “a hand which has more fingers as parts than is typical” With count extension E=hand, Q=supernumerary parts, E2=finger, Count=6 could also say +1 “a hand with 6 fingers, which is more than normal” Proposed PATO sub-hierarchy part count quality lacking parts having normal part count lacking all lacking some having extra parts Mass count qualities furriness porosity Bearers possess these qualities by virtue of the number and qualities of their granular parts hairiness by virtue of: number, width, length, spacing, orientation of hair-parts What is the essence of hairy? Attempt 1: E=skin,Q=hairy but what if we do not have ‘hairy’ pre-coordinated in PATO? Alternate representation: E=skin,Q=excess fine-grained parts,E2=hair open Q: is this equivalent to, subsumed by, or related to representation 1? Another representation: E=hair, Q=long this is something different increased brown fat cells “increased brown fat cells” Attempt 1: E=brown fat cell, Q=increased autodef: a brown fat cell which is increased Restate as: E=organism, Q=increased (granular) parts, E2=brown fat cell works better for “increased brown fat cells in upper body” OBD handles reasoning should annotations to above be returned for queries of PATO term “fatty”? Relativity PATO has terms like large increased Context is implicit strain species genus/order Extension to make explicit In_comparison_to Bigger than average for species/genus/etc E=brain,Q=large,In_comparison_to=<taxon-id> default is same species as specified by genotype Comparative phenotypes E=brain,Q=large,In_comparison_to=<phenotypeid> requires recording phenotype IDs e.g. two experiments, same genotype, different environment, phenotype stronger in one Ratio & relative_to Use cases: Size of brain relative to size of skull Size of brain relative to size of skull in an individual when compared to size brain relative to size of skull in a typical individual of that species E=brain,Q=large,relative_to=skull, in_comparison_to=<taxon_id> defaults to: whole organism Modifiers E=bone,Q=notched,Mod=mild Standardised qualitative modifiers Meaning dependent on E and Q Can have multiple, cross-cutting scales qualitative and numeric/score based absent mildly realised normal strong extreme 0 1 10 100 0.00 1 0.01 0.1 Modifiers modify meaning of Q Influence of Mod on Q is subjective but the direction is objective Example: E=adult_human_body, during=sleep Q={low,high} temperature, Mod=mild,normal,moderate,extreme abn+ abnormal normal abnormal abn+ absent mildly realised normal strong extreme word scale NOT 0.00 1 1 10 100 score scale N/A 35 37 39 temperature 37 36.5 36 35 low temperature 37 37.5 38 39 high temperature 0.01 0.1 Modifiers and PATO Modifiers are not qualities Modifiers should not be in a true ontology But we can still give these PATO IDs kept separate from core PATO ontology Modifiers can be relational relatum may be implicit e.g. abnormal_with_respct_to Modifiers serve similar purposes as Values in tripartite EAV model Difference: absent, low, high are not treated in the same way as genuine quality types like ‘notched’, ‘large’, ‘diploid’, ‘pink’ they are ingredients in the representation language, and not types in an ontology Heterozygous flies have very short and highly branched arista laterals. E=arista lateral, EQuant=all, Q=short, Mod=extreme, in_comparison_to=Dmel E=arista lateral EQuant=all, Q=branched, Mod=extreme, in_comparison_to=Dmel Measurements Measurements are not qualities In the schema, representations of measurements are attached to the representations of qualities Separate measurement schema don’t need to discuss fine grained details here some data providers will require more detail than others here e.g. averages, error bars, … E=tail, Q=length, Measurement=2cm E=tail, Q=length, Measurement=+.1cm, in_comparison_to=<individual-id> Likeness Shape likeness Homeotic transformations E=A2 segment,Q=morphology,Similar_to=A3 segment Interp: An A2 segment with the morphological features of an A3 segment but not “heart-shaped leaves” Conditionals Some phenotypes are only realised under certain conditions environment including chemical interactions, RNA interference etc we should separate conditionals (this phenotype only seen in this envirotype with this genotype) from data (on this occasion this phenotype seen in this envirotype with this genotype) Schema elements Phenotype character: E Q EQuant E2 Count Mod Relative_to In_comparison_to Similar_to Measurment Temporal Most of these elements are optional data providers pick and choose their level of future extensions boolean combinations conditional statements eg environment modifier ++ + . - --