SUPPLEMENTARY MATERIAL: ANNOTATION GUIDELINES 1. Concepts with a Concept Unique Identifier (CUI) in the Unified Medical Language System (UMLS), should only be annotated if they belong to the Mantra terminology. This means that concepts should only be annotated if they belong to MeSH, MedDRA, or SNOMED-CT, and if they have a semantic type that is part of one of the following semantic groups: Anatomy (ANAT), Chemicals and drugs (CHEM), Devices (DEVI), Disorders (DISO), Geographic areas (GEOG), Living beings (LIVB), Objects (OBJC), Phenomena (PHEN), Physiology (PHYS), Procedures (PROC). Concepts belonging to other vocabularies or semantic groups should not be annotated. Information about the mapping of semantic types to semantic groups can be found at http://semanticnetwork.nlm.nih.gov/SemGroups/. 2. Annotations from a silver standard corpus (SSC) derived from a number of indexing systems will be provided as pre-annotations of concepts. A preannotation consists of the span of text corresponding with the concept, preferred name, semantic type (possibly more than one), semantic group, and CUI. Different pre-annotations for the same span of text may be provided. 3. To find further information on a span of text (pre-annotated or selected by the annotator), annotators can link out to the UMLS Terminology Services (https://uts.nlm.nih.gov/home.html) or to the Mantra terminology through the Search field in the brat Edit annotation or New annotation screens by clicking the UMLS or Mantra link, respectively. 4. Annotators have to check that a pre-annotation is correct, i.e., they should verify that the preferred name, semantic type and group, and CUI of the pre-annotation match the term in the text. Definitions of many (but not all) concepts can be found in the UMLS Terminology Services. The context of the annotation can be used to assess the correct meaning. Examples of wrongly pre-annotated concepts (underlined): “CMV attacks the retina …” => “attacks” has been pre-annotated as C1304680 (preferred term (PT) “attack”, type “finding”, group DISO), which is not correct. The annotation should be removed. “… in patients with normal coronary angiography” => “normal” (PT “skin appearance normal”, type “finding”, group DISO, C0558145) is incorrectly annotated. The annotation should be removed. “… may affect your ability to drive …” => “drive” (PT “drive”, type “mental process”, group PHYS, C0013126) is incorrectly annotated and should be removed. Note that MeSH contains the concept “automobile driving” (C0004379), but it belongs to type “daily or recreational activity” (ACTI), and thus should not be annotated. 5. In case of multiple pre-annotations of the same span of text, the annotators should try to disambiguate, using contextual information and information about the preannotated concepts (PT, type, group, concept definition if available). If the difference in meaning between the concepts is not clear or the context provides insufficient information to disambiguate, annotations are kept. If several annotations are applicable but one annotation is more specific than another, only the most specific annotation should be kept. Examples of multiple pre-annotated concepts (underlined): “… can lead not only to pain …” => “pain” is pre-annotated as C0030193 (PT “pain”, type “sign or symptom”, group DISO) and C0242936 (PT “pain clinics”, “manufactured object”, OBJC). The latter annotation is incorrect and should be removed. “… retina of the eye …” => “eye” has two annotations: C0015392 (PT “eye”) and C1280202 (PT “entire eye”). Both concepts belong to type “body part, organ, or organ component”, group ANAT. The distinction between the concepts is not clear and both annotations should be kept. “Breast-feeding.” => “breast-feeding” has two annotations: C1623040 (PT “breast-feeding (mother)”, type “finding”, group DISO) and C0006147 (PT “breast feeding”, type “organism function”, group PHYS). Since there is no context information to disambiguate between the concepts, both annotations should be kept. “Neupro 4 mg/24 h transdermal patch. Each patch releases …” => “transdermal patch” is correctly pre-annotated as C0991556 (PT “transdermal patch”). The second occurrence of “patch” has multiple pre-annotations, including C1305400 (“surgical patch”), C1707974 (“extended-release film”), and C0991556 (“transdermal patch”). Based on contextual information, only the last annotation is kept. “1 ml of solution contains 40 micrograms travoprost and 5 mg timolol” => “solution” has two pre-annotations: C0037633 (PT “solutions”, type “substance”, group OBJC) and C0525069 (PT “pharmaceutical solutions”, type “biomedical or dental material”, group CHEM). Although both annotations are applicable, the latter is the more specific annotation, and the former should be removed. 2 6. When a concept is nested within another concept, annotate the most detailed description of the concept. The general principle is to annotate the concept that is more specific and informative. If the more specific concept is not contained in the Mantra terminology, annotate the less specific concept or concepts. If a concept is overlapping with another concept, annotate both concepts. Examples of nested or overlapping concepts: “Exercised-induced asthma …” => Both “exercised-induced asthma” and “asthma” have been annotated. Since only the more specific and informative concept should be annotated, the annotation for “asthma” is removed. “Two cases of subcutaneous panniculitis-like T-cell lymphoma …” => “subcutaneous panniculitis-like T-cell lymphoma” has been annotated, as well as “panniculitis”, “cell”, and “lymphoma”. Only the annotation for “subcutaneous panniculitis-like T-cell lymphoma” should be kept, as being most specific and informative. “Musculoskeletal tumors: …” => “tumors” has been pre-annotated (PT “neoplasms”, type “neoplastic process”, C0027651) since the more specific concept “musculoskeletal tumors” (PT “malignant neoplasm musculoskeletal”, “neoplastic process”, C0036210) is not part of MeSH, MedDRA or SNOMED-CT and thus not contained in the Mantra terminology. “This results in smooth muscle relaxation and inflow of …” => The concepts “smooth muscle” (C1267092) and “muscle relaxation” (C0026836) have both been pre-annotated. Since the more specific concept “smooth muscle relaxation” does not exist in the UMLS, both annotations are kept. 7. If a concept consists of two discontiguous spans of text, the annotator should mark the related text spans (using the Add Frag. feature in the brat Edit annotation or New annotation screens) and assign the corresponding CUI. Examples of concepts that consist of fragmented text spans: “Patients with renal or hepatic impairment …” => “renal” (PT “kidney”, group ANAT, C0022646) and “hepatic impairment” (PT “hepatic impairment”, DISO, C0948807) have been pre-annotated. The former should be replaced by an annotation of the fragments “renal” and “impairment” (PT “renal impairment”, DISO, C0341697) “… sympathetic middle cervical ganglion …” => The fragments “sympathetic” and “cervical ganglion” should be annotated with the single concept C0446846 (PT “Cervical sympathetic ganglion”, ANAT); “middle cervical ganglion” should be annotated separately, as C1281049 (PT “Entire middle cervical ganglion”, ANAT) and C0228999 (PT “Structure of middle cervical ganglion”, ANAT). 3 “… Chinese Hamster Ovary (CHO) cells.” => The concept C0085080 (PT “Chinese hamster ovary cell”) should be annotated twice, first by annotation of the fragments “Chinese Hamster Ovary” and “cells”, and second by annotation of the fragments “CHO” (without parentheses) and “cells”. 8. If a concept was not pre-annotated, the annotator should indicate the boundaries of the concept and its CUI (specified as “C” followed by seven digits in the Notes field of the brat New annotation or Edit annotation screens). Misspelled terms are also annotated. Examples of missed annotations: “… tablets are packed in unit dose blisters in packs of …” => “blisters” has wrongly been pre-annotated as C0005758 (PT “blister”, type “pathologic function”, DISO) and C0344311 (PT “blistering eruption”, type “disease or syndrome”, DISO). These annotations should be removed and the annotation C1319688 (PT “blister – unit of product usage”, type “biomedical or dental material”, group CHEM) should be added. “Malignant skin tumours.” => “malignant” (C1306459) and “skin tumours” (C0037286) have been pre-annotated as two separate concepts. They should be removed and the whole term should be annotated as C0007114 (PT “malignant neoplasm of skin”). “…, Sjorgen’s syndrome, …” => The term “Sjorgen’s syndrome” is misspelled and has not been pre-annotated. An annotation with the concept C1527336 (PT “Sjogren’s syndrome”) should be added. 9. The annotator should annotate a subword, i.e., a part of a word, if the subword is contained in the Mantra terminology and the full word is not. Examples of subword annotations: “lumbaalzak” (Dutch for “lumbar sac”) => There is no concept corresponding with this term in the Mantra terminology (nor in the UMLS). In English, the annotator should annotate “lumbar” (C0024090, PT “Lumbar region”, ANAT). In Dutch, the subword “lumbaal” should be annotated with the same CUI. “Penisgewebe” (German for “penile tissues”) => The subword “Penis” should be annotated with two concepts, C0030851 (PT “Penis”, “ANAT”) and C1280739 (PT “Entire penis”, ANAT); the subword “gewebe” should be annotated with C0040300 (PT “Body tissue”, ANAT). 4