BIOINFORMATICS Vol. 21 Suppl. 2 2005, Supplementary Material doi:10.1093/bioinformatics/bti1142 Text Mining Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature Robert Hoffmann*,† and Alfonso Valencia National Center of Biotechnology, CNB-CSIC. Campus de la UAM. Madrid E-28049, Spain. Received on March 15, 2005; revised on May 28, 2005. This supplementary material is thought to give details on issues that are not fully discussed in the main article. It does not refer to other concepts than those introduced in the main article. and bad (Table 1). Depending on the quality of a synonym, either its occurrence alone was sufficient to associate the corresponding gene to a text, or additional evidence (e.g. a second synonym or parts of the name) was necessary. 1. METHODS 1.1 Gene Synonym Identification in Biomedical text Gene synonym Dictionary Data for the initial gene synonym dictionary was collected from publicly available databases, e.g. LocusLink (Pruitt et al., 2000), FlyBase, or UniProt (Apweiler et al., 2004). In addition to primary gene symbols and names, when known, this initial dictionary also contained alternative synonyms and orthographic variants. Although genome databases, were the principal sources of primary symbols and names, manually curated resources of synonyms, e.g. the HUGO Nomenclature Committee (White et al., 1997), are also of great importance, since genes that were discovered and described by different laboratories often have various synonyms. Synonyms from this initial dictionary were then altered using an iterative process to account for orthographical variations. The heuristics for these alterations were derived manually, and besides general rules (e.g. Expanding ‘LAIR1’ to ‘LAIR-1’), also contained organism-specific rules (e.g. Converting ‘AP*’ to ‘APETALA*’ in Arabidopsis thaliana). As a result of this iterative extension, the total number of synonyms expanded drastically from half a million to 3.2 million, with the average number of synonyms per gene increasing from 3 to 19. Pre-generating the complete list of synonyms makes it possible to assess the quality (or uniqueness) of each synonym, according to the number of distinguishing characteristics, before using it in the identification process. Depending on the quality of a synonym, either its occurrence alone was sufficient to associate the corresponding gene to a text, or additional evidence was necessary (e.g. a second synonym or parts of the name). Moreover, every synonym is defined by the grade of precision and by where in a sentence it must be found (e.g. casesensitive, not at the beginning of a sentence). Evaluation of gene synonym qualities Before synonyms are used for the identification process, their quality (or uniqueness) is assessed according to the number of distinguishing characteristics and classified into good, weak, very weak To whom correspondence should be addressed. †Present address: Memorial Sloan-Kettering Cancer Center, MSKCC, New York, NY 10021, USA. Table 1. Features and criteria for defining synonym qualities Synonym quality criteria 1. Length in characters 2. Length in terms 3. Single or multiple term composition 4. Character composition (digits, upper and lower cases, special characters) 5. Collision with stop words (e.g. the, and, or, etc…) 6. Collision with common English words 7. Collision with other biological entities (e.g. chemical abbreviations) 8. Collision with other gene synonyms (ambiguities within and between organisms) 9. Original synonym or derivative synonym Creating the Gene Article Index The management of about 3.2 million synonyms (as well as their quality attributes, search criteria, and organisms of origin, etc.) is extremely expensive in terms of RAM memory. Therefore, an initial raw association of genes to abstracts is carried out without taking detailed problems into consideration. All synonyms were processed into single word baits that were then searched in a caseinsensitive manner and by employing hashcode comparisons (Pieprzyk and Sadeghiyan, 1993) instead of character-bycharacter comparisons or regular expressions. In a subsequent step, genes were permanently assigned to whole abstracts, taking all contextual information within abstracts into consideration (e.g. the organisms mentioned in or assigned to the text, predefined negative contexts, etc.). Considering this contextual information makes it possible to account for cases in which, for example, no high-quality synonym is found, but two weak synonyms are present. Although complete abstracts were considered, sentence boundaries and sentence structures (e.g. brackets) were taken into account; a synonym, for example, must not span two sentences. Furthermore, acronyms found in the text corpus were resolved to detect conflicts between gene synonyms and other biological or medical entities (e.g. diseases, methods, etc.). After all genes had been assigned to abstracts, they were assigned to precise positions in the text. This is especially important for © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 1 Hoffmann et al. overlapping synonyms, in other words, synonyms that start with the same term, but are of different length (e.g. ‘erythropoietin’ and ‘erythropoietin receptor’), or when synonyms only differ in their composition of upper and lower cases. Clustered by: Diagnosis, Therapy Genes MeSH Terms Anatomy 1.2 Assessment of MeSH and Gene Cluster Contents To make large text resources navigable it is essential to organize the literature into clusters of similar sizes and similar information content. To have an idea of the specificity and content of gene clusters, we compared them to clusters based on MeSH terms. MeSH is the thesaurus of the National Library of Medicine (Kim et al., 2001) and was designed as a fast access classification for PubMed. Here, a gene cluster is defined as all abstracts that quote a specific gene; MeSH Clusters are defined as all abstract that have a certain MeSH term associated. We compared the frequencies of MeSH terms in the documents of a specific cluster with their frequencies in the background dictionary to estimate a cluster’s content. The background dictionary was constructed from about 4000 different MeSH terms (covering anatomy, diseases, physical and biological science, chemicals, and drugs) previously associated to all PubMed abstracts. The probability (PT) of finding a term (T) the observed number of times (k) in a document cluster (C) was then calculated for all clusters from the binomial distribution, given the known background frequency (p) and the total number of terms within a cluster (n). n k nk PT ( k n, p) = p q k where n N , the number of terms assigned to a document cluster (C), k = 0,1,...n, the number of occurrences of term (T ) within the cluster (C), p = PT (X = 1), the relative frequency of term (T ) in the background dictionary, and q = PT (X = 0) = 1 p n! n := k!( n k )! k 0 1 2 3 n n n!: = n e 2n Chemicals and Drugs Diseases 0 10 20 30 40 50 Relative number of clusters (%) 60 Fig. S1. Comparison of gene and MeSH cluster contents The most frequent term categories are listed on the vertical axis. In general, gene and MeSH clusters cover most domains to a comparable extent, except for the prevalence of diseases in MeSH clusters and of molecular aspects in gene clusters. Genes and proteins are not distinguished in the iHOP system because it is nearly impossible to detect whether an author refers to a gene or a gene product; clear nomenclature guidelines to separate both concepts are missing or not used. However, we do not expect a separation of genes and proteins to enhance the navigability of the final network. 1.3 Sentence Ranking The basic concept of navigation was enhanced by the weighting of sentences according to simple features and statistical parameters, such that the probability of finding relevant information first would be increased (Table 2). Table 2. Sentence ranking criteria when 0 k < n , where n,k N when 0 n k when n s when n > s where n N and n! was estimated using Stirling' s approximation for large n (s: = 100). For a reference on Stirling’s approximation see (Knuth, 1997). To avoid floating point errors, the natural logarithm of the probability was calculated (ZT=ln PT). Consequently, for a given cluster, the smaller the value of ZT the more specific is a term. For keeping the general comparison between MeSH and gene cluster straight forward, we consider for each cluster only the most significant term and its corresponding category and only clusters of a user manageable size of less than 200 articles. Table 7 lists the most significant terms in gene clusters. We found that gene and MeSH clusters cover most domains to a comparable extent (Figure S1), with the expected exception that MeSH clusters show an emphasis on diseases, whilst gene clusters are focused more strongly on molecular aspects, e.g. genomic imprinting, regulation, oxidative stress. 2 Biological Sciences Sentence ranking criteria 1. Existence/ Number of associative verbs (in a gene-verb-gene pattern) per sentence (e.g. ‘bind’, ‘phosphorylate’, etc.) 2. Number of genes per sentence (to avoid non-specific relationships and lists of genes) 3. Length of sentence (the shorter the sentence the more precise its information) 4. Existence of experimental evidence for certain gene-gene associations 5. Statistical significance of the genes in a sentence; avg. Z-Score (ln PT) PT ( k n , p ) = n p k k q nk where nN , the number of other genes occurring in documents about the gene in question (G), k = 0 ,1,...n , the number of occurrences of gene X within documents about gene G, and p = PT (X =1 ), the relative frequency of gene X in PubMed (background). Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature lnk_pm__pm_mh pm_mh_tree lnk_unipub__pm_mh pm_mh lnk_pm__pm_mh__pm_sh freq_pm_mh_year raw_lnk_unipub__entity freq_pm_mh pm_sh sentence lnk_pm_mh__synonym country item_association unipub language location_chr lnk_unipub__translocation aberration_chr lnk_unipub__organism pm_author pubmed journal lnk_pm__pm_author_subject lnk_pm__pm_author lnk_unipub__gene pm_type lnk_mm_location_chr__gene synonym lnk_gene__synonym lnk_organism__synonym organism entity_synonym freq_gene lnk_synonym__island gene homologue_gene stopword lnk_gene__genbank island word_name lnk_pm__foreign_key lnk_gene__foreign_key foreign_key freq_island lnk_island__island_iso island_iso experimental_data exp_gene_gene_data pm_rn lnk_pm__pm_rn freq_island_iso Fig. S2. Relational Database Schema. Each box in this schema represents a table in the relational database. There are two central elements or tables outstanding in this schema: the publication table (unipub) and the gene table (gene) which are connected over the gene-article index in the lnk_unipub__gene table. Additional information is arranged around these two central concepts: MeSH terms (pm_mh), organisms (organism), chromosome abberations (aberration_chr), a simple English dictionary (word_name) and references to external databases (foreign_key). 1.4 Associative Verbs It is known that about 90% of all active relations between proteins in the literature are expressed syntactically as “protein verb protein” (Blaschke and Valencia, 2001). In the current implementation of iHOP, verbs that describe interactions between proteins (e.g. ‘bind’, ‘phosphorylate’, ‘inhibit’, ‘activate’, etc.) and occur between two proteins can be highlighted to facilitate the perception of relevant information. Sentence boundaries and brackets are taken into account in the pattern scanning; both proteins must occur in the same sentence or in the same bracket. This simple syntax 3 Hoffmann et al. covers most active relations; however, other syntaxes could be included in future developments. Furthermore the occurrence of these patterns influences the weighting of sentences positively. See Table 3 for the complete list of verbs identified in protein-verbprotein patterns. 1.5 Database Schema The database schema is partially reproduced in Figure S2 and illustrates the two main concepts in the system; genes on the one hand and scientific documents on the other. The database schema also covers information about organisms (synonyms and NCBI taxonomy identifiers), a simple English dictionary, and the complete MeSH thesaurus. 2 2.1 RESULTS Recall and Precision of Gene Synonym Identification Precision and Recall of the gene detection module of iHOP are shown in Table 4 (as of March 2005). Problems and types of incorrectly identified genes (false positives) are listed in Table 5. 32% of all false positives can be put down to gene synonyms which differ only in their cases and which were not correctly used by the authors (e.g. mouse ‘Mtx2’ and human ‘MTX2’). F-measures of the Gene Synonym Identification Process The harmonic F-measures (F-measures combine precision and recall into one comparable score; F=2*recall*precision/(recall+precision)) ranged between %70 and 91% depending on the organism (see Table 4). There are no other systems which cover all eight organisms or were applied to the complete PubMed database, therefore only partial comparisons are possible. However, most systems for restricted domains publish lower or comparable F-measures: Fukuda et al. (Fukuda et al., 1998) have suggested that even an extremely simple set of rules can yield a high F-measure (96%) when specialised on a certain subject (i.e. SH3-domain). A more generally applicable rule-based approach by Franzen et al. (Franzen et al., 2002) obtains an F-measure of about 67%. Morgan et al. (Morgan, 2003) and Collier et al. (Collier, 2000) report levels of F=73% and F=75% respectively, by using Hidden Markov models. Tsuruoka et al. use a dictionary approach and filter through a simple Bayesian classifier (F=70%) (Tsuruoka, 2003). Krauthammer et al. developed a dictionary approach and use the BLAST algorithm for searching; considering partial matches as positive they report F=75% (Krauthammer et al., 2000). Mika et al. use support vector machines (SVMs) to identify protein names in MEDLINE abstracts (F=76%) (Mika and Rost, 2004). Assessment with BioCreative corpus BioCreative (Critical Assessment for Information Extraction in Biology) is an open evaluation of systems on a number of biological text mining tasks (Yeh et al., 2004). The comparison of text mining methods is generally difficult, especially when different text corpora or gold standards where used for evaluation. Efforts to evaluate and compare methods systematically are thus crucial for the development of the field. The BioCreative assessment comes closest to the real world needs in biology, as it mimics the manual 4 curation process behind model organism databases (Yeh et al., 2003; Yeh et al., 2004). An important contribution of these assessments is the manual annotation of biological text corpora for training and evaluation. In other words, articles have to be read by human experts to highlight all relevant biological entities within the text. Such corpora are extremely expensive in their creation and only few others are available (Hirschman et al., 2002; Kim et al., 2003). In the BioCreative task 1B, systems were evaluated on their ability to identify the genes and gene products mentioned in the abstracts of yeast, Drosophila melanogaster and Mus musculus. In the context of this work, however, the comparison with BioCreative is only orientating, since in BioCreative task 1B gene identification was assessed for each organism independently and on organism specific document corpora, thus the important realworld problem of synonym ambiguity was somewhat circumvented (Hirschman, 2004). In this work, complete PubMed was screened and no prior assumption could therefore be made about the organisms of the genes in a given document. Evaluation of false negatives Recall would be 100%, if there were no collisions between gene synonyms, synonyms from other organisms, other scientific terms or even common English words and if scientists would observe the orthographical guidelines (Proux et al., 1998). A general problem is the assignment of a gene synonym to the correct organism, particularly in cases where the same synonym is used in two organisms and only differs in the composition of upper and lower cases. In fact, most of the false negatives can be explained by synonyms that were correctly identified, but then assigned to the wrong organism. For example, nomenclature guidelines (i.e. HUGO) define human gene symbols to be preferentially in upper cases, whereas synonyms of homologue mouse genes should be in lower cases. However, authors do not always observe these guidelines, especially when the homology of two genes (e.g. mouse ‘Mtx2’ and human ‘MTX2’) forms part of their argument. Examples for other cases of false negatives are shown in Table 6. The main source of false negatives are synonyms that occur in substrings of complex expressions (e.g. ‘sup-19’ in ‘SUP-19(M210)V’) and synonyms that collide with common English words (e.g. ‘brown’, ‘yellow’, etc.). Multiple term names will always be less frequently detected since the number of possible orthographic variations increases drastically with the number of terms. At the current stage, there is no automatic system that solves the problem of synonyms colliding with synonyms from other organism or common English words. Considering that about half a million new articles appear in PubMed every year and that new genes are discovered and described continuously, gene name identification methods will probably always lag behind this creative process (Hoffmann and Valencia, 2003). Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature WEBSITE REFERENCES http://www.ncbi.nlm.nih.gov/LocusLink/, LocusLink http://flybase.bio.indiana.edu/, FlyBase http://www.ihop-net.org/UniPub/iHOP, Information Hyperlinked over Proteins http://www.gene.ucl.ac.uk/nomenclature/, HUGO Nomenclature Committee http://www.ncbi.nlm.nih.gov/PubMed/, PubMed, National Library of Medicine http://www.informatics.jax.org/searches/marker_form.shtml, MGI, Mouse Genome Informatics http://zfin.org/ZFIN/, ZFIN, The Zebrafish Information Network http://www.wormbase.org, WormBase http://www.arabidopsis.org/, Tair, The Arabidopsis Information Resource http://www.yeastgenome.org/, SGD, Saccharomyces Genome Database http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome, NCBI Genome http://www.uniprot.org, UniProt, Universal Protein Resource REFERENCES Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, 32 Database issue, D115-119. Blaschke, C. and Valencia, A. (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform, 12, 123-134. Collier, N., Nobata, C. and Tsujii, J. (2000) Extracting the names of genes and gene products with a Hidden Markov Model. Proc COLING 2000, 201-207. Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P. and Coster, J. (2002) Protein names and how to find them. Int J Med Inf, 67, 49-61. Fukuda, K., Tamura, A., Tsunoda, T. and Takagi, T. (1998) Toward information extraction: identifying protein names from biological papers. Pac Symp Biocomput, 707-718. Hirschman, L. (2004) Personal Communication. Hirschman, L., Morgan, A.A. and Yeh, A.S. (2002) Rutabaga by any other name: extracting biological names. J Biomed Inform, 35, 247-259. Hoffmann, R. and Valencia, A. (2003) Life cycles of successful genes. Trends Genet, 19, 79-81. Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J. (2003) GENIA corpus-a semantically annotated corpus for bio-textmining. Bioinformatics, 19 Suppl 1, I180-I182. Kim, W., Aronson, A.R. and Wilbur, W.J. (2001) Automatic MeSH term assignment and quality assessment. Proc AMIA Symp, 319-323. Krauthammer, M., Rzhetsky, A., Morozov, P. and Friedman, C. (2000) Using BLAST for identifying gene and protein names in journal articles. Gene, 259, 245-252. Mika, S. and Rost, B. (2004) Protein names precisely peeled off free text. Bioinformatics, 20 Suppl 1, I241-I247. Morgan, A., Hirschman, L., Yeh, A. and Colosimo, M. (2003) Gene Name Extraction Using FlyBase Resources. ACL-03 Workshop on Natural Language Processing in Biomedicine, 1-8. Pieprzyk, J. and Sadeghiyan, B. (1993) Design of hashing algorithms. SpringerVerlag, Berlin; New York. Proux, D., Rechenmann, F., Julliard, L., Pillet, V.V. and Jacq, B. (1998) Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Inform Ser Workshop Genome Inform, 9, 72-80. Pruitt, K.D., Katz, K.S., Sicotte, H. and Maglott, D.R. (2000) Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. Trends Genet, 16, 4447. Tsuruoka, Y.a.T., J. (2003) Boosting Precision and Recall of Dictionary-Based Protein Name Recognition. ACL-03 Workshop on Natural Language Processing in Biomedicine, 41-48. White, J.A., McAlpine, P.J., Antonarakis, S., Cann, H., Eppig, J.T., Frazer, K., Frezal, J., Lancet, D., Nahmias, J., Pearson, P. et al. (1997) Guidelines for human gene nomenclature (1997). HUGO Nomenclature Committee. Genomics, 45, 468-471. Yeh, A., Blaschke, C., Appweiler, R., Wu, C., Blake, J., Donaldson, I., Hunter, L., Friedman, C., Valencia, A. and Hirschman, L. (2004) BioCreAtIvE, Critical Assessment of Information Extraction systems in biology. http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html. Yeh, A.S., Hirschman, L. and Morgan, A.A. (2003) Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics, 19 Suppl 1, i331-339. 5 Hoffmann et al. Table 3. Most frequently identified verbs in gene-verb-gene patterns 6 Class of interaction Verb enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic enzymatic permanent permanent physical physical physical physical physical physical physical physical physical physical physical physical physical physical physical positional positional regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory regulatory phosphorylate cleave delete dephosphorylate oxidize ubiquitinate cut catalyse hyper-phosphorylate hydrolyze split un-methylate trans-phosphorylate nick link fuse bind interact (form) complex couple transport stabilize co-immunoprecipitate footprint dock co-purify attach destabilize connect detach co-immunopurify co-localize co-migrate induce inhibit activate stimulate regulate enhance suppress control block promote down-regulate repress influence co-express deactivate cross-react over-produce co-regulate super-induce Rel. frequency (%) 2.11 0.43 0.25 0.13 0.07 0.05 0.04 0.04 0.03 0.03 0.03 0.01 0.01 0.00 1.74 0.61 14.48 4.82 4.31 0.56 0.50 0.46 0.36 0.27 0.15 0.07 0.07 0.06 0.05 0.00 0.00 0.87 0.02 18.92 8.21 7.61 7.31 5.92 4.70 2.90 2.66 2.55 1.85 1.63 1.27 0.74 0.62 0.31 0.09 0.04 0.02 0.01 Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature Table 4. Assessment of recall and precision Gene quoting docs (RD) (Goldstd. LocusLink) RD: Recall (%) Nr of gene-article references (GA) (Goldstd. LocusLink) GA: Recall GA: Recall (%) Exact gene localisations (GL) (Goldstd. manual) GL: True positives (Goldstd. manual) GL: Precision, exact localisation (%) F-measure (%) Recall of genearticle references (Goldstd. BioCreative) H.sapiens M.musculus D.melanogaster D.rerio C.elegans A.thaliana S.cerevisiae E.coli Average 54460 23669 13147 586 2202 - 25489 - 19926 91.53 81.28 85.56 81.06 94.6 - 88.59 - 87.1 68313 30415 28604 873 4199 - 59652 - 32009 55916 81.85 21256 69.89 18596 65.01 620 71.02 3810 90.74 - 49945 83.73 - 25023 77 403 381 438 360 597 295 536 476 436 351 354 409 352 584 271 533 442 412 87.1 92.91 93.38 97.78 97.82 91.86 99.44 92.86 94.14 84.39 79.77 76.66 82.28 94.15 - 90.91 - 84.69 - 66.02 62.78 - - - 85.71 - 71.5 Table 5. Error sources for incorrectly identified genes Type of error Other gene from same organism Gene family, not individual gene Gene from related organism (e.g. human/ mouse, E.Coli/ B.subtilis) Disease Gene from other organism (in gene dictionary) Gene from other organism (not in gene dictionary) Other biomedical concept Symbol, other H.sapiens M.musculus D.melanogaster D.rerio C.elegans A.thaliana S.cerevisiae E.coli Total 7 3 2 1 0 7 2 3 25 4 1 4 1 0 8 1 4 23 10 14 2 2 2 6 24 60 12 2 2 0 0 8 0 2 0 10 0 0 0 0 0 0 14 22 6 1 0 1 0 1 0 0 9 7 3 8 1 1 2 0 3 25 4 3 5 0 0 0 0 0 12 7 Hoffmann et al. Table 6. Reasons for false negatives Type of problem Synonym Potential finding site Organism English word English word English word English word English word English word English word English word English word English word English word English word English word Too short Too short Substring Substring Substring Substring Substring Substring Substring Substring Substring Part of complex Part of complex Initials Initials Initials Initials Initials fused toes male lethal Brown Capricious Map Yellow Stoned Fringe Roulette crippled leg Mastermind Chip big brain T1 A1 CRA AMHR SURF 6 IGF 1 IL-1B let 33 unc-82 rad-2 sup-19 TIS21 WAF1/CIP1/SDI1 PKCA TNF-RS RICH P85B Rora Spi-9 fused toes male lethal brown capricious map yellow stoned fringe roulette crippled leg mastermind chip big brain T1 A1 TCRAV2 AMHRII SURF-1 6 IGF-BINDING IL-1B+3953 LETHAL 33 UNC-82(E1220)IV RAD-2—ARE SUP-19(M210)V TIS21/PC3/BTG1/TOB WAF1/CIP1/SDI1 PKCALPHA TNF-RS RICH P85BETA RORALPHA SYSTEM PI-9 Mus musculus Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Drosophila Human Human Human Human Human Human Human C. elegans C. elegans C. elegans C. elegans Human Human Human Human Human Mus musculus Mus musculus Table 7. List of MeSH and Gene clusters and most significant terms Cluster ID M_G Name Nr of abstracts Term category Term Z-Score 6368 153220 181 157592 142368 … 21457 21418 12813 18631 17534 … GENE GENE GENE GENE GENE … MeSH MeSH MeSH MeSH MeSH … SNRPN AT5g62530 AGXT oxyR NPH1 172 183 129 162 21 Biological Sciences Biological Sciences Diseases Biological Sciences Biological Sciences Genomic Imprinting Gene Expression Regulation Hyperoxaluria Oxidative Stress Phototropism -791.94 -484.08 -477.15 -434.4 -429.53 Thoracic Cavity Thoracic Wall Giant Cells, Foreign-Body Diagnostic Techniques, Urological Tin Polyphosphates 14 65 132 72 190 Anatomy Anatomy Diagnostic, Therapy Diagnostic, Therapy Chemicals and Drugs Abdominal Cavity Abdominal Wall Absorbable Implants Absorbable Implants Acidulated Phosphate Fluoride -27.83 -32.69 -27.1 -17.53 -9.46 List truncated. The complete list (18703 lines) is available at http://www.pdg.cnb.uam.es/supplement/gene_navigation/index.html. 8