The SPECIALIST Lexicon and NLP Tools Allen Browne Nov 6, 2009 Guy Divita National Library of Medicine The SPECIALIST Lexicon Nov. 6, 2009 Text processing Lexical tools SPECIALIST LEXICON The SPECIALIST Lexicon • A syntactic lexicon • Biomedical and general English • Over 430,000 records Lexicon Growth George A. Miller The Science of Words 1991 Frequency Spectrum of Medline 2006 3000001 2500001 V(m,N) 2000001 1500001 1000001 500001 1 1 100 10000 M 1000000 100000000 Frequency Spectrum: Alice in Wonderland Bayaan, 2001 The SPECIALIST Lexicon • Morphology – Inflection – Derivation • Orthography – Spelling variants • Syntax – Complementation for verbs, nouns, and adjectives Morphology • Inflectional – nucleus, nuclei – cauterize, cauterizes, cauterized, cauterizing – red, redder reddest • Derivational – laryngeal -- larynx – transport -- transportation Derivational Morphology Dictionary+ology+is Inflectional Morphology octopus octopi octopuses Orthography Spelling Variation • • • • • • align -- aline Grave’s disease -- Graves’s disease -- Graves’ disease anesthetize -- anesthetise Esophagus -- oesophagus foetus – fetus centre -- center Orthography Syntax -- Verb Complements • intran – I’ll treat. • tran=np – He treated the patient. • ditran=np,pphr(with,np) – She treated the patient with the drug. Syntax -- Verb Complements {base=treat entry=E0061964 cat=verb variants=reg intran tran=np tran=pphr(with,np) tran=pphr(of,np) ditran=np,pphr(to,np) ditran=np,pphr(with,np) ditran=np,pphr(for,np) cplxtran=np,advbl nominalization=treatment|noun|E0061968 } Lexicon Parts of Speech Noun Adj Verb Adv Prep Pron Conj Det Modal Aux Compl 350000 300000 250000 200000 150000 100000 50000 0 Noun Adj Verb Adv Prep Pron Conj Det Modal Aux Compl Miller -- 1991 Lexicon Unit Records {base=chronic {base=Kaposi's sarcoma spelling_variant=Kaposi entry=E0016869 sarcoma cat=adj entry=E0003576 variants=inv cat=noun position=attrib(1 variants=uncount ) variants=reg position=pred variants=glreg stative } } {base=aspirate {base=in entry=E0010803 entry=E0033870 cat=verb cat=prep variants=reg } tran=np nominalization=aspiration|noun|E0010804 } Acronyms and Abbreviations {base=BLM entry=E0319730 cat=noun variants=uncount variants=metareg abbreviation_of=bilayer lipid membrane|E0319734 abbreviation_of=bimolecular liquid membrane|E0319733 abbreviation_of=bleomycin|E0013378 } Orthographic vs. Lexicographic Word: Why, for instance, if a two-word boy scout feels chilly on his one-word campground, does he pull up a twoword camp chair in front of his one-word campfire? Anyone who seeks a strictly logical answer to such questions is chasing will-o'-the-wisps (chargeable in telegrams as a single word, because of the hyphens) in a semantic bog. Louis Salomon, Semantics and Common Sense, Holt Rinehart and Winston, 1966. UTF-8 {base=resume spelling_variant=résumé spelling_variant=resumé entry=E0053099 cat=noun variants=reg } {base=deja vu spelling_variant=deja-vu spelling_variant=déjà vu entry=E0021340 cat=noun variants=uncount } {base=role spelling_variant=rôle entry=E0053757 cat=noun variants=reg } {base=cafe spelling_variant=café entry=E0420690 cat=noun variants=reg } Noun Variants {base=Kaposi's sarcoma spelling_variant=Kaposi sarcoma entry=E0003576 cat=noun variants=uncount variants=reg variants=glreg } • Kaposi’s sarcoma • Kaposi’s sarcomas • Kaposi’s sarcomata • Kaposi sarcoma • Kaposi sarcomas • Kaposi sarcomata Regular Nouns The plural suffix is s. y becomes ie following a consonant before s. e is inserted before s if the base ends in s, z, x, ch, or s Leach – Leaches Stomach – Stomachs irregular Greco-latin Regular nouns Uncount Nouns (abstract or mass) {base=smallpox entry=E0056359 cat=noun variants=uncount } {base=potassium entry=E0049387 cat=noun variants=uncount } * This form does not occur • • • • • • * a smallpox * two smallpoxes much smallpox * a potassium * two potassiums much potassium Fixed Plural Nouns {base=police entry=E0048616 cat=noun variants=plur } {base=scissors entry=E0054633 cat=noun variants=plur } Irregular Nouns {base=corpus entry=E0019113 cat=noun variants=irreg|corpora| variants=reg } {base=larynx entry=E0036919 cat=noun variants=irreg|larynges| variants=reg } Regular Verbs • The third person present tense suffix is s. – y becomes ie following a consonant before s. – e is inserted between z, x, ch, or sh and s. • The past tense suffix is ed. The – ypast becomes participle ie following is the asame consonant as the before past tens The ed. present participle suffix is ing. Final eie following is deleted before ed. -– y becomes a consonant before ing. - Final e is deleted before ing unless preceded by e, y or o. Regular Verbs • dismiss: dismisses, dismissed, dismissing • agree: agrees; agreed; agreeing • dry: dries, dried, drying Regular Doubling Verbs • End in a CVC pattern • Double the final consonant before ed and ing. • Are otherwise regular • variants=regd control: controls, controlled, controlling Irregular Verbs {base=bite entry=E0013219 cat=verb variants=irreg|bite|bites|bit|bitten|biting| intran tran=np cplxtran=np,advbl } Ancillary Data Bases • Synonymy – sm.db • Derivation – dm.db, dm.rules • Inflection – im.rules • Neoclassical compounds – nc.db Derivational Facts and Rules dm.facts treatment|noun|treat|verb prohibition|noun|prohibitive|adj cell lineage|noun|cell line|noun photochemotherapeutic|adj|photochemotherapy|noun pharmacotherapeutic|adj|pharmacotherapy|noun Derivational Facts and Rules dm.rules # e.g. alienation|alienate ation$|noun|ate|verb ration|rate; station|state; Inflectional Facts and Rules im.rules # Noun rules (glreg) us$|noun|singular|i$|noun|plural antus|anti; ma$|noun|singular|mata$|noun|plu ral a$|noun|singular|ae$|noun|plural um$|noun|singular|a$|noun|plural on$|noun|singular|a$|noun|plural sis$|noun|singular|ses$|noun|plura l is$|noun|singular|ides$|noun|plural men$|noun|singular|mina$|noun|pl ural Neoclassical compounds nc.db abdomin(o)|abdomen|root ab|away from|prefix acanth(o)|prickle|root acar(o)|mite|root acetabul(o)|acetabulum|root ad|towards|prefix agogue|inducing|terminal albumin(o)|albumin|root sis|condition|terminal stomy|surgical opening|terminal PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCO NIOSIS pneu.mo.no.ul.tra.mi.cro.scop.ic.sil.i.co.vol.ca.no.co.ni.o.sis \'n(y)u:-m*-(.)no--.*l-tr*-.mi-kr*-'ska:p-ik-'sil-i-(.)ko--(.)v\ n [NL, fr. Gk pneumo-n + ISV ultramicroscopic + NL silicon +]a:l-'ka--no--.ko--ne--'o--s*s ISV volcano + Gk konis dust : a pneumoconiosis caused by the inhalation of very fine silicate or quartz dust -- Merriam Webster's 3rd International Dictionary, page 1747. The Protein of a tobacco mosaic virus, Dahlemense strain acetylseryltyrosylserylisoleucylthreonylserylprolylserylglutami nylphenylalanylvalylphenylalanylleucylserylserylvalyltryptophy lalanylaspartylprolylisoleucylglutamylleucylleucyllasparaginylv alylcysteinylthreonylserylserylleucylglycllasparaginylglutaminy lphenylalanylglutaminylthreonylglutaminylglutaminylalanylargi nylthreonylthreonylglutaminylvalylglutaminylglutaminylphenyla lanylserylglutaminylvalyltryptophyllysylprolylphenylalanylprolyl glutaminylserylthreonylvalylarginylphenylalanylprolylglycylasp artylvalyltyrosyllsyslvalyltyrosylarginyltyrosylasparaginylalanyl valylleucylaspartylprolylleucylisoleucylthreonylalanylleucylleuc ylglycylthryonylphenylalanylaspartylthreonylarginylasparaginyl arginylisoleucylisoleucylglutamylvalylglutamylasparaginylgluta minylglutaminylserylprolylthreonylthreonylalanylglutamylthreo nylleucylaspartylalanylthreonylarginylarginylvalylaspartylaspar tylalanylthreonylvalylalanylisoleucylarginylserylalanylasparagi nylisoleucylasparaginylleucylvallasparaginylglutamylleucylvaly larginylglycylthreonylglycylleucultyrosylasparaginylglutaminyla sparaginylthreonylphenylalanylglutamylserylmethionylserylgly cylleucylvalyltryptophylthreonylserylalanylprolylalanylserine Synonyms sm.db alar|adj|wing|noun amygdaline|adj|tonsil|noun articular|adj|joint|noun bulbar|adj|medulla oblongata|noun fununcular|adj|boil|noun genicular|adj|knee|noun hepatocellular|adj|liver cells|noun lazar|adj|leprosy|noun lenticular|adj|crystalline lens|noun ypsiliform|adj|upsiloid|adj wolfram|noun|tungsten|noun double vision|noun|diplopia|noun Text processing Lexical tools SPECIALIST LEXICON Lexical Tools • Wordind -- breaks strings into words – Produces the Metathesaurus word indexes (MRXW) • LVG -- performs various lexical transformations • NORM -- a selection of LVG transformations, – Used for Metathesaurus indexing – Produces the Metathesaurus Normalized word and string indexes (MRXNW & MRXNS) – Used to access those indexes Normalization • • • • • • • • • • • • • • • • Hodgkin Disease HODGKINS DISEASE Hodgkin's Disease Disease, Hodgkin's HODGKIN'S DISEASE Hodgkin's disease Hodgkins Disease Hodgkin's disease NOS Hodgkin's disease, NOS Disease, Hodgkins Diseases, Hodgkins Hodgkins Diseases Hodgkins disease hodgkin's disease Disease;Hodgkins Disease, Hodgkin • disease hodgkin SPECIALIST NLP Tools • Tokenizers – Sentence, Section, Phrases, Words • Term variant lookup • Part of Speech Tagger • Index Maker The Lexical Systems Group • Allen Browne: browne@nlm.nih.gov • Guy Divita: divita@nlm.nih.gov • Chris Lu: lu@nlm.nih.gov SPECIALIST NLP Tools Lister Hill National Center For Biomedical Communications National Library of Medicine Guy Divita Fall 2009 SPECIALIST NLP Tools SPECIALIST.nlm.nih.gov Tools The Lexicon Document Tokenization Tools Lexicon Term Lookup POS Tagger Term Manipulation Tools Spelling Suggestion Visual Annotation Tool Text Categorization Tool SPECIALIST Lexical Tools Java Utilities to build smarter indexes Term Based Tools SPECIALIST Lexical Tools • 56 Term transformations treats inflections combinations treating treated nominalizations treat treatment treatments derivations treatability Term Based Tools Java treaty treatable treater SPECIALIST Lexical Tools Java colour coloring colored colors inflections Spelling variants nominalizations color chromaticities colorlessness combinations derivations Chromaticness Term Based Tools synonyms chromatic colorless colorant colorful SPECIALIST Lexical Tools Java seconds seconded inflections serous combinations Ser SOR secant secondarily acronym expansions second nominalizations synonyms derivations acronyms s’s sec secondly secondary s Term Based Tools SPECIALIST Lexical Tools Java lowercase Input term Strip diacritics Remove possessive The tools can be arranged so that the output of one is the input to Remove stop words Strip punctuation another. Word order sort Term Based Tools Example of a quick and dirty normalization Output term SPECIALIST Lexical Tools: Norm Java remove genitives replace punctuation with spaces remove stop words lowercase uninflect each word spelling variants Term Based Tools word order sort SPECIALIST Lexical Tools: Norm Java • • • • • • • • • • • • • • • • Hodgkin Disease HODGKINS DISEASE Hodgkin's Disease Disease, Hodgkin's HODGKIN'S DISEASE Hodgkin's disease Hodgkins Disease Hodgkin's disease NOS Hodgkin's disease, NOS Disease, Hodgkins Diseases, Hodgkins Hodgkins Diseases Hodgkins disease hodgkin's disease Disease;Hodgkins Disease, Hodgkin Hash into a class of lexically similar terms disease hodgkin Term Based Tools Spelling Retrieval Tools • GSpell – – – – – – A term retrieval tool N-gram nearest neighbor algorithm MetaPhone phonetic spelling normalization Homophones Common misspellings Candidates sorted by an edit distance and frequency of occurrence from a corpus • Build Your Own – Custom crafted dictionaries are key to spelling suggestion Term Based Tools dTagger • Assigns Parts of Speech (POS) to words in text • NP parsers need terms with Parts of Speech assigned to determine phrase breaks and head assignment Document Based Tools noun adj/adv verb conj det prep aux/ modal Legend SPECIALIST Text Tools Sections Sentences Phrases Terms Words Lexicon Entries Document Formats – Medline – HL7 – Free text Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez A1 Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). Document Based Tools Abstract – – – – – – SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity Title Auth Tokenizes Text into SPECIALIST Text Tools SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez A1 Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). Word Tokenizer Term Tokenizer POS tagger Phrase Chunker Phrase Variant Generation Document Based Tools Document Section Sentence Token Java Document Container Text Annotation Document Based Tools Text Annotation (2) Simple Format Offset|Size|Tag|SubTag|Annotation|.. 0| 0|BOS | | | 0| 3|det | | |The 4| 2|adj | | |in 6| 1|adj | | |-| 7| 5|adj | | |vitro| 13| 11|noun | | |antioxidant| 25| 8|noun | | |activity| 34| 37| 45| 46| 56| 2|prep 7|adj 1|lp 9|noun 4|noun | | |of| | | |natural| | | |(| | | |essential | | |oils| Text Categorization • A set of tools for: Text categorization Indexing & retrieval Document classification Word sense disambiguation etc.. • Based on JD Indexing (Susanne Humphrey) Vector/ Cosine coefficient method Unsupervised Uses the pre-existing assignment of Journal Descriptors Document to Medline abstracts Based Tools High performance Text Categorization • Command line tools JDI (Journal Descriptor Indexing) STI (Semantic Type Indexing) STRI (Semantic Type Real-Time Indexing) MLT (MEDLINE Tokenizer) STWSD (ST Word Sense Disambiguation) • Web Tools • Java APIs Document Based Tools MetaMap Transfer (MMTx) • Extracts UMLS concepts from text • Java Implementation of MetaMap Meta Mapping (1000): C0496836 (Malignant neoplasm of eye, unspecified) [Neoplastic Process] Doc Tools Retinoblastoma What is retinoblastoma? Retinoblastoma is a rare type of eye cancer that develops in the retina, which is the part of the eye that detects light and color. Although this disorder can occur at any age, it usually develops in young children. MMTx Why would you want to use it? -- Medical Text -- ---- - --- - -- -- --- -------- -- -- -- - -- -- -- -- -- ----- ---- ---- --- ---- - ------- ------- ----- --------- -- --- ----- -- --- --- ------------ ---- --- ------ - ----- --------- --- ---- ----- - -- Document Based Tools SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity P3I Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez A1 Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). De-Identification Dates Names Addresses Phone No’s Age > 90 Alpha numeric identifiers SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity P3I Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez A1 Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity Patient Record Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez A1 Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). De-Identification (2) Term Tokenizer POS tagger Name Recognition Address Recognition Human Edit and Review Annotation Tool Document Based Tools Redaction SPME determination of volatile aldehydes for evaluation of in-vitro antioxidant activity Elena E. Stashenko, Miguel A. Puertas, Jairo R. Martínez Chromatography Laboratory, Research Center for Biomolecules, School of Sciences, Industrial University of Santander. A.A. 678, Bucaramanga, Colombia A1 Transform back to Original Format Abstract: The in-vitro antioxidant activity of natural (essential oils, vitamin E) or synthetic substances (tert-butyl hydroxy anisole (BHA), Trolox) has been evaluated by monitoring volatile carbonyl compounds released in model lipid systems subjected to peroxidation. The procedure employed methodology previously developed for the determination of carbonyl compounds as their pentafluorophenylhydrazine derivatives which were quantified, with high sensitivity, by means of capillary gas chromatography with electron-capture detection. Linoleic acid and sunflower oil were used as model lipid systems. Lipid peroxidation was induced in linoleic acid by the Fe2+ ion (1 mmol L-1, 37 °C, 12 h) and in sunflower oil by heating in the presence of O2 (220 °C, 2 h). To do List Patient De-identification Text Tools, Gspell, dTagger 2010 Distribution to include •Using the 2010 Lexicion •Updated to Java 1.6 (Generics) •Berkeley Java DB replaced with HyperSQL •dTagger integrated with Annotation Tool •Eclipse Projects SPECIALIST NLP Tools Lister Hill National Center For Biomedical Communications National Library of Medicine Resources SPECIALIST NLP Tools http://SPECIALIST.nlm.nih.gov Presentations, Tutorials and Documentation http://lexsrv3.nlm.nih.gov/SPECIALIST/docs Lexicon Technical Document http://SPECIALIST.nlm.nih.gov/technicalReport.pdf Contacts General Questions umlslex@nlm.nih.gov Allen Browne browne@nlm.nih.gov Guy Divita divita@nlm.nih.gov Chris Lu lu@nlm.nih.gov