Supplementary Material - Computational Biology Center at Memorial

BIOINFORMATICS
Vol. 21 Suppl. 2 2005, Supplementary Material
doi:10.1093/bioinformatics/bti1142
Text Mining
Supplementary Material: Implementing the iHOP Concept for
Navigation of Biomedical Literature
Robert Hoffmann*,† and Alfonso Valencia
National Center of Biotechnology, CNB-CSIC. Campus de la UAM. Madrid E-28049, Spain.
Received on March 15, 2005; revised on May 28, 2005.
This supplementary material is thought to give details on issues that
are not fully discussed in the main article. It does not refer to other
concepts than those introduced in the main article.
and bad (Table 1). Depending on the quality of a synonym, either
its occurrence alone was sufficient to associate the corresponding
gene to a text, or additional evidence (e.g. a second synonym or
parts of the name) was necessary.
1. METHODS
1.1
Gene Synonym Identification in Biomedical text
Gene synonym Dictionary
Data for the initial gene synonym dictionary was collected from
publicly available databases, e.g. LocusLink (Pruitt et al., 2000),
FlyBase, or UniProt (Apweiler et al., 2004). In addition to primary
gene symbols and names, when known, this initial dictionary also
contained alternative synonyms and orthographic variants. Although genome databases, were the principal sources of primary
symbols and names, manually curated resources of synonyms, e.g.
the HUGO Nomenclature Committee (White et al., 1997), are also
of great importance, since genes that were discovered and described by different laboratories often have various synonyms.
Synonyms from this initial dictionary were then altered using an
iterative process to account for orthographical variations. The heuristics for these alterations were derived manually, and besides
general rules (e.g. Expanding ‘LAIR1’ to ‘LAIR-1’), also contained organism-specific rules (e.g. Converting ‘AP*’ to
‘APETALA*’ in Arabidopsis thaliana). As a result of this iterative
extension, the total number of synonyms expanded drastically from
half a million to 3.2 million, with the average number of synonyms
per gene increasing from 3 to 19. Pre-generating the complete list
of synonyms makes it possible to assess the quality (or uniqueness)
of each synonym, according to the number of distinguishing characteristics, before using it in the identification process. Depending
on the quality of a synonym, either its occurrence alone was sufficient to associate the corresponding gene to a text, or additional
evidence was necessary (e.g. a second synonym or parts of the
name). Moreover, every synonym is defined by the grade of precision and by where in a sentence it must be found (e.g. casesensitive, not at the beginning of a sentence).
Evaluation of gene synonym qualities
Before synonyms are used for the identification process, their quality (or uniqueness) is assessed according to the number of distinguishing characteristics and classified into good, weak, very weak
To whom correspondence should be addressed.
†Present address: Memorial Sloan-Kettering Cancer Center, MSKCC,
New York, NY 10021, USA.
Table 1. Features and criteria for defining synonym qualities
Synonym quality criteria
1. Length in characters
2. Length in terms
3. Single or multiple term composition
4. Character composition (digits, upper and lower cases, special characters)
5. Collision with stop words (e.g. the, and, or, etc…)
6. Collision with common English words
7. Collision with other biological entities (e.g. chemical abbreviations)
8. Collision with other gene synonyms (ambiguities within and between
organisms)
9. Original synonym or derivative synonym
Creating the Gene Article Index
The management of about 3.2 million synonyms (as well as their
quality attributes, search criteria, and organisms of origin, etc.) is
extremely expensive in terms of RAM memory. Therefore, an
initial raw association of genes to abstracts is carried out without
taking detailed problems into consideration. All synonyms were
processed into single word baits that were then searched in a caseinsensitive manner and by employing hashcode comparisons (Pieprzyk and Sadeghiyan, 1993) instead of character-bycharacter comparisons or regular expressions.
In a subsequent step, genes were permanently assigned to whole
abstracts, taking all contextual information within abstracts into
consideration (e.g. the organisms mentioned in or assigned to the
text, predefined negative contexts, etc.). Considering this contextual information makes it possible to account for cases in which,
for example, no high-quality synonym is found, but two weak
synonyms are present. Although complete abstracts were considered, sentence boundaries and sentence structures (e.g. brackets)
were taken into account; a synonym, for example, must not span
two sentences. Furthermore, acronyms found in the text corpus
were resolved to detect conflicts between gene synonyms and other
biological or medical entities (e.g. diseases, methods, etc.).
After all genes had been assigned to abstracts, they were assigned
to precise positions in the text. This is especially important for
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
1
Hoffmann et al.
overlapping synonyms, in other words, synonyms that start with
the same term, but are of different length (e.g. ‘erythropoietin’ and
‘erythropoietin receptor’), or when synonyms only differ in their
composition of upper and lower cases.
Clustered by:
Diagnosis, Therapy
Genes
MeSH Terms
Anatomy
1.2
Assessment of MeSH and Gene Cluster Contents
To make large text resources navigable it is essential to organize
the literature into clusters of similar sizes and similar information
content. To have an idea of the specificity and content of gene
clusters, we compared them to clusters based on MeSH terms.
MeSH is the thesaurus of the National Library of Medicine (Kim
et al., 2001) and was designed as a fast access classification for
PubMed. Here, a gene cluster is defined as all abstracts that quote a
specific gene; MeSH Clusters are defined as all abstract that have a
certain MeSH term associated.
We compared the frequencies of MeSH terms in the documents of
a specific cluster with their frequencies in the background dictionary to estimate a cluster’s content. The background dictionary was
constructed from about 4000 different MeSH terms (covering
anatomy, diseases, physical and biological science, chemicals, and
drugs) previously associated to all PubMed abstracts. The probability (PT) of finding a term (T) the observed number of times (k)
in a document cluster (C) was then calculated for all clusters from
the binomial distribution, given the known background frequency
(p) and the total number of terms within a cluster (n).
n k nk
PT ( k n, p) = p q
k
where n N , the number of terms assigned to a document cluster (C),
k = 0,1,...n, the number of occurrences of term (T ) within the cluster (C),
p = PT (X = 1), the relative frequency of term (T ) in the background dictionary, and
q = PT (X = 0) = 1 p
n!
n := k!( n k )!
k
0
1 2 3 n
n
n!: = n e 2n
Chemicals and Drugs
Diseases
0
10
20
30
40
50
Relative number of clusters (%)
60
Fig. S1. Comparison of gene and MeSH cluster contents The most frequent
term categories are listed on the vertical axis. In general, gene and MeSH
clusters cover most domains to a comparable extent, except for the prevalence of diseases in MeSH clusters and of molecular aspects in gene clusters. Genes and proteins are not distinguished in the iHOP system because
it is nearly impossible to detect whether an author refers to a gene or a gene
product; clear nomenclature guidelines to separate both concepts are missing or not used. However, we do not expect a separation of genes and proteins to enhance the navigability of the final network.
1.3
Sentence Ranking
The basic concept of navigation was enhanced by the weighting of
sentences according to simple features and statistical parameters,
such that the probability of finding relevant information first would
be increased (Table 2).
Table 2. Sentence ranking criteria
when 0 k < n , where n,k N
when 0 n k when n s
when n > s
where n N and n! was estimated using Stirling' s approximation for large n (s: = 100).
For a reference on Stirling’s approximation see (Knuth, 1997). To
avoid floating point errors, the natural logarithm of the probability
was calculated (ZT=ln PT). Consequently, for a given cluster, the
smaller the value of ZT the more specific is a term. For keeping the
general comparison between MeSH and gene cluster straight forward, we consider for each cluster only the most significant term
and its corresponding category and only clusters of a user manageable size of less than 200 articles.
Table 7 lists the most significant terms in gene clusters. We found
that gene and MeSH clusters cover most domains to a comparable
extent (Figure S1), with the expected exception that MeSH clusters
show an emphasis on diseases, whilst gene clusters are focused
more strongly on molecular aspects, e.g. genomic imprinting, regulation, oxidative stress.
2
Biological Sciences
Sentence ranking criteria
1. Existence/ Number of associative verbs (in a gene-verb-gene pattern)
per sentence (e.g. ‘bind’, ‘phosphorylate’, etc.)
2. Number of genes per sentence (to avoid non-specific relationships and
lists of genes)
3. Length of sentence (the shorter the sentence the more precise its information)
4. Existence of experimental evidence for certain gene-gene associations
5. Statistical significance of the genes in a sentence; avg. Z-Score (ln PT)
PT ( k n , p ) =
n p k
k
q
nk
where nN , the number of other genes occurring in documents about the
gene in question (G), k = 0 ,1,...n , the number of occurrences of gene X
within documents about gene G, and p = PT (X =1 ), the relative frequency
of gene X in PubMed (background).
Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature
lnk_pm__pm_mh
pm_mh_tree
lnk_unipub__pm_mh
pm_mh
lnk_pm__pm_mh__pm_sh
freq_pm_mh_year
raw_lnk_unipub__entity
freq_pm_mh
pm_sh
sentence
lnk_pm_mh__synonym
country
item_association
unipub
language
location_chr
lnk_unipub__translocation
aberration_chr
lnk_unipub__organism
pm_author
pubmed
journal
lnk_pm__pm_author_subject
lnk_pm__pm_author
lnk_unipub__gene
pm_type
lnk_mm_location_chr__gene
synonym
lnk_gene__synonym
lnk_organism__synonym
organism
entity_synonym
freq_gene
lnk_synonym__island
gene
homologue_gene
stopword
lnk_gene__genbank
island
word_name
lnk_pm__foreign_key
lnk_gene__foreign_key
foreign_key
freq_island
lnk_island__island_iso
island_iso
experimental_data
exp_gene_gene_data
pm_rn
lnk_pm__pm_rn
freq_island_iso
Fig. S2. Relational Database Schema. Each box in this schema represents a table in the relational database. There are two central elements or tables
outstanding in this schema: the publication table (unipub) and the gene table (gene) which are connected over the gene-article index in the
lnk_unipub__gene table. Additional information is arranged around these two central concepts: MeSH terms (pm_mh), organisms (organism), chromosome abberations (aberration_chr), a simple English dictionary (word_name) and references to external databases (foreign_key).
1.4
Associative Verbs
It is known that about 90% of all active relations between proteins
in the literature are expressed syntactically as “protein verb protein” (Blaschke and Valencia, 2001). In the current implementation
of iHOP, verbs that describe interactions between proteins (e.g.
‘bind’, ‘phosphorylate’, ‘inhibit’, ‘activate’, etc.) and occur between two proteins can be highlighted to facilitate the perception
of relevant information. Sentence boundaries and brackets are
taken into account in the pattern scanning; both proteins must occur in the same sentence or in the same bracket. This simple syntax
3
Hoffmann et al.
covers most active relations; however, other syntaxes could be
included in future developments. Furthermore the occurrence of
these patterns influences the weighting of sentences positively. See
Table 3 for the complete list of verbs identified in protein-verbprotein patterns.
1.5
Database Schema
The database schema is partially reproduced in Figure S2 and illustrates the two main concepts in the system; genes on the one hand
and scientific documents on the other. The database schema also
covers information about organisms (synonyms and NCBI taxonomy identifiers), a simple English dictionary, and the complete
MeSH thesaurus.
2
2.1
RESULTS
Recall and Precision of Gene Synonym Identification
Precision and Recall of the gene detection module of iHOP are
shown in Table 4 (as of March 2005). Problems and types of incorrectly identified genes (false positives) are listed in Table 5. 32%
of all false positives can be put down to gene synonyms which
differ only in their cases and which were not correctly used by the
authors (e.g. mouse ‘Mtx2’ and human ‘MTX2’).
F-measures of the Gene Synonym Identification Process
The harmonic F-measures (F-measures combine precision and
recall into one comparable score;
F=2*recall*precision/(recall+precision)) ranged between %70 and
91% depending on the organism (see Table 4). There are no other
systems which cover all eight organisms or were applied to the
complete PubMed database, therefore only partial comparisons are
possible.
However, most systems for restricted domains publish lower or
comparable F-measures: Fukuda et al. (Fukuda et al., 1998) have
suggested that even an extremely simple set of rules can yield a
high F-measure (96%) when specialised on a certain subject (i.e.
SH3-domain). A more generally applicable rule-based approach by
Franzen et al. (Franzen et al., 2002) obtains an F-measure of about
67%. Morgan et al. (Morgan, 2003) and Collier et al. (Collier,
2000) report levels of F=73% and F=75% respectively, by using
Hidden Markov models. Tsuruoka et al. use a dictionary approach
and filter through a simple Bayesian classifier (F=70%) (Tsuruoka,
2003). Krauthammer et al. developed a dictionary approach and
use the BLAST algorithm for searching; considering partial
matches as positive they report F=75% (Krauthammer et al., 2000).
Mika et al. use support vector machines (SVMs) to identify protein
names in MEDLINE abstracts (F=76%) (Mika and Rost, 2004).
Assessment with BioCreative corpus
BioCreative (Critical Assessment for Information Extraction in
Biology) is an open evaluation of systems on a number of biological text mining tasks (Yeh et al., 2004). The comparison of text
mining methods is generally difficult, especially when different
text corpora or gold standards where used for evaluation. Efforts to
evaluate and compare methods systematically are thus crucial for
the development of the field. The BioCreative assessment comes
closest to the real world needs in biology, as it mimics the manual
4
curation process behind model organism databases (Yeh et al.,
2003; Yeh et al., 2004).
An important contribution of these assessments is the manual annotation of biological text corpora for training and evaluation. In
other words, articles have to be read by human experts to highlight
all relevant biological entities within the text. Such corpora are
extremely expensive in their creation and only few others are
available (Hirschman et al., 2002; Kim et al., 2003). In the BioCreative task 1B, systems were evaluated on their ability to identify the genes and gene products mentioned in the abstracts of
yeast, Drosophila melanogaster and Mus musculus.
In the context of this work, however, the comparison with BioCreative is only orientating, since in BioCreative task 1B gene
identification was assessed for each organism independently and
on organism specific document corpora, thus the important realworld problem of synonym ambiguity was somewhat circumvented (Hirschman, 2004).
In this work, complete PubMed was screened and no prior assumption could therefore be made about the organisms of the genes in a
given document.
Evaluation of false negatives
Recall would be 100%, if there were no collisions between gene
synonyms, synonyms from other organisms, other scientific terms
or even common English words and if scientists would observe the
orthographical guidelines (Proux et al., 1998). A general problem
is the assignment of a gene synonym to the correct organism, particularly in cases where the same synonym is used in two organisms and only differs in the composition of upper and lower cases.
In fact, most of the false negatives can be explained by synonyms
that were correctly identified, but then assigned to the wrong organism. For example, nomenclature guidelines (i.e. HUGO) define
human gene symbols to be preferentially in upper cases, whereas
synonyms of homologue mouse genes should be in lower cases.
However, authors do not always observe these guidelines, especially when the homology of two genes (e.g. mouse ‘Mtx2’ and
human ‘MTX2’) forms part of their argument. Examples for other
cases of false negatives are shown in Table 6. The main source of
false negatives are synonyms that occur in substrings of complex
expressions (e.g. ‘sup-19’ in ‘SUP-19(M210)V’) and synonyms
that collide with common English words (e.g. ‘brown’, ‘yellow’,
etc.). Multiple term names will always be less frequently detected
since the number of possible orthographic variations increases
drastically with the number of terms.
At the current stage, there is no automatic system that solves the
problem of synonyms colliding with synonyms from other organism or common English words. Considering that about half a million new articles appear in PubMed every year and that new genes
are discovered and described continuously, gene name identification methods will probably always lag behind this creative process
(Hoffmann and Valencia, 2003).
Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature
WEBSITE REFERENCES
http://www.ncbi.nlm.nih.gov/LocusLink/, LocusLink
http://flybase.bio.indiana.edu/, FlyBase
http://www.ihop-net.org/UniPub/iHOP, Information Hyperlinked over Proteins
http://www.gene.ucl.ac.uk/nomenclature/, HUGO Nomenclature Committee
http://www.ncbi.nlm.nih.gov/PubMed/, PubMed, National Library of Medicine
http://www.informatics.jax.org/searches/marker_form.shtml, MGI, Mouse Genome
Informatics
http://zfin.org/ZFIN/, ZFIN, The Zebrafish Information Network
http://www.wormbase.org, WormBase
http://www.arabidopsis.org/, Tair, The Arabidopsis Information Resource
http://www.yeastgenome.org/, SGD, Saccharomyces Genome Database
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome, NCBI Genome
http://www.uniprot.org, UniProt, Universal Protein Resource
REFERENCES
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M. et al. (2004) UniProt: the Universal
Protein knowledgebase. Nucleic Acids Res, 32 Database issue, D115-119.
Blaschke, C. and Valencia, A. (2001) The potential use of SUISEKI as a protein
interaction discovery tool. Genome Inform Ser Workshop Genome Inform, 12,
123-134.
Collier, N., Nobata, C. and Tsujii, J. (2000) Extracting the names of genes and gene
products with a Hidden Markov Model. Proc COLING 2000, 201-207.
Franzen, K., Eriksson, G., Olsson, F., Asker, L., Liden, P. and Coster, J. (2002) Protein names and how to find them. Int J Med Inf, 67, 49-61.
Fukuda, K., Tamura, A., Tsunoda, T. and Takagi, T. (1998) Toward information
extraction: identifying protein names from biological papers. Pac Symp Biocomput, 707-718.
Hirschman, L. (2004) Personal Communication.
Hirschman, L., Morgan, A.A. and Yeh, A.S. (2002) Rutabaga by any other name:
extracting biological names. J Biomed Inform, 35, 247-259.
Hoffmann, R. and Valencia, A. (2003) Life cycles of successful genes. Trends Genet,
19, 79-81.
Kim, J.D., Ohta, T., Tateisi, Y. and Tsujii, J. (2003) GENIA corpus-a semantically
annotated corpus for bio-textmining. Bioinformatics, 19 Suppl 1, I180-I182.
Kim, W., Aronson, A.R. and Wilbur, W.J. (2001) Automatic MeSH term assignment
and quality assessment. Proc AMIA Symp, 319-323.
Krauthammer, M., Rzhetsky, A., Morozov, P. and Friedman, C. (2000) Using BLAST
for identifying gene and protein names in journal articles. Gene, 259, 245-252.
Mika, S. and Rost, B. (2004) Protein names precisely peeled off free text. Bioinformatics, 20 Suppl 1, I241-I247.
Morgan, A., Hirschman, L., Yeh, A. and Colosimo, M. (2003) Gene Name Extraction
Using FlyBase Resources. ACL-03 Workshop on Natural Language Processing in
Biomedicine, 1-8.
Pieprzyk, J. and Sadeghiyan, B. (1993) Design of hashing algorithms. SpringerVerlag, Berlin; New York.
Proux, D., Rechenmann, F., Julliard, L., Pillet, V.V. and Jacq, B. (1998) Detecting
Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. Genome Inform Ser Workshop Genome Inform, 9, 72-80.
Pruitt, K.D., Katz, K.S., Sicotte, H. and Maglott, D.R. (2000) Introducing RefSeq and
LocusLink: curated human genome resources at the NCBI. Trends Genet, 16, 4447.
Tsuruoka, Y.a.T., J. (2003) Boosting Precision and Recall of Dictionary-Based Protein
Name Recognition. ACL-03 Workshop on Natural Language Processing in Biomedicine, 41-48.
White, J.A., McAlpine, P.J., Antonarakis, S., Cann, H., Eppig, J.T., Frazer, K., Frezal,
J., Lancet, D., Nahmias, J., Pearson, P. et al. (1997) Guidelines for human gene
nomenclature (1997). HUGO Nomenclature Committee. Genomics, 45, 468-471.
Yeh, A., Blaschke, C., Appweiler, R., Wu, C., Blake, J., Donaldson, I., Hunter, L.,
Friedman, C., Valencia, A. and Hirschman, L. (2004) BioCreAtIvE, Critical Assessment
of
Information
Extraction
systems
in
biology.
http://www.pdg.cnb.uam.es/BioLINK/BioCreative.eval.html.
Yeh, A.S., Hirschman, L. and Morgan, A.A. (2003) Evaluation of text data mining for
database curation: lessons learned from the KDD Challenge Cup. Bioinformatics,
19 Suppl 1, i331-339.
5
Hoffmann et al.
Table 3. Most frequently identified verbs in gene-verb-gene patterns
6
Class of interaction
Verb
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
enzymatic
permanent
permanent
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
physical
positional
positional
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
regulatory
phosphorylate
cleave
delete
dephosphorylate
oxidize
ubiquitinate
cut
catalyse
hyper-phosphorylate
hydrolyze
split
un-methylate
trans-phosphorylate
nick
link
fuse
bind
interact
(form) complex
couple
transport
stabilize
co-immunoprecipitate
footprint
dock
co-purify
attach
destabilize
connect
detach
co-immunopurify
co-localize
co-migrate
induce
inhibit
activate
stimulate
regulate
enhance
suppress
control
block
promote
down-regulate
repress
influence
co-express
deactivate
cross-react
over-produce
co-regulate
super-induce
Rel. frequency (%)
2.11
0.43
0.25
0.13
0.07
0.05
0.04
0.04
0.03
0.03
0.03
0.01
0.01
0.00
1.74
0.61
14.48
4.82
4.31
0.56
0.50
0.46
0.36
0.27
0.15
0.07
0.07
0.06
0.05
0.00
0.00
0.87
0.02
18.92
8.21
7.61
7.31
5.92
4.70
2.90
2.66
2.55
1.85
1.63
1.27
0.74
0.62
0.31
0.09
0.04
0.02
0.01
Supplementary Material: Implementing the iHOP Concept for Navigation of Biomedical Literature
Table 4. Assessment of recall and precision
Gene quoting docs
(RD) (Goldstd.
LocusLink)
RD: Recall (%)
Nr of gene-article
references (GA)
(Goldstd. LocusLink)
GA: Recall
GA: Recall (%)
Exact gene localisations (GL)
(Goldstd. manual)
GL: True positives
(Goldstd. manual)
GL: Precision,
exact localisation
(%)
F-measure (%)
Recall of genearticle references
(Goldstd. BioCreative)
H.sapiens
M.musculus
D.melanogaster
D.rerio
C.elegans
A.thaliana
S.cerevisiae
E.coli
Average
54460
23669
13147
586
2202
-
25489
-
19926
91.53
81.28
85.56
81.06
94.6
-
88.59
-
87.1
68313
30415
28604
873
4199
-
59652
-
32009
55916
81.85
21256
69.89
18596
65.01
620
71.02
3810
90.74
-
49945
83.73
-
25023
77
403
381
438
360
597
295
536
476
436
351
354
409
352
584
271
533
442
412
87.1
92.91
93.38
97.78
97.82
91.86
99.44
92.86
94.14
84.39
79.77
76.66
82.28
94.15
-
90.91
-
84.69
-
66.02
62.78
-
-
-
85.71
-
71.5
Table 5. Error sources for incorrectly identified genes
Type of error
Other gene from same
organism
Gene family, not
individual gene
Gene from related
organism
(e.g. human/ mouse,
E.Coli/ B.subtilis)
Disease
Gene from other organism
(in gene dictionary)
Gene from other organism
(not in gene dictionary)
Other biomedical
concept
Symbol, other
H.sapiens
M.musculus
D.melanogaster
D.rerio
C.elegans
A.thaliana
S.cerevisiae
E.coli
Total
7
3
2
1
0
7
2
3
25
4
1
4
1
0
8
1
4
23
10
14
2
2
2
6
24
60
12
2
2
0
0
8
0
2
0
10
0
0
0
0
0
0
14
22
6
1
0
1
0
1
0
0
9
7
3
8
1
1
2
0
3
25
4
3
5
0
0
0
0
0
12
7
Hoffmann et al.
Table 6. Reasons for false negatives
Type of problem
Synonym
Potential finding site
Organism
English word
English word
English word
English word
English word
English word
English word
English word
English word
English word
English word
English word
English word
Too short
Too short
Substring
Substring
Substring
Substring
Substring
Substring
Substring
Substring
Substring
Part of complex
Part of complex
Initials
Initials
Initials
Initials
Initials
fused toes
male lethal
Brown
Capricious
Map
Yellow
Stoned
Fringe
Roulette
crippled leg
Mastermind
Chip
big brain
T1
A1
CRA
AMHR
SURF 6
IGF 1
IL-1B
let 33
unc-82
rad-2
sup-19
TIS21
WAF1/CIP1/SDI1
PKCA
TNF-RS RICH
P85B
Rora
Spi-9
fused toes
male lethal
brown
capricious
map
yellow
stoned
fringe
roulette
crippled leg
mastermind
chip
big brain
T1
A1
TCRAV2
AMHRII
SURF-1 6
IGF-BINDING
IL-1B+3953
LETHAL 33
UNC-82(E1220)IV
RAD-2—ARE
SUP-19(M210)V
TIS21/PC3/BTG1/TOB
WAF1/CIP1/SDI1
PKCALPHA
TNF-RS RICH
P85BETA
RORALPHA
SYSTEM PI-9
Mus musculus
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Drosophila
Human
Human
Human
Human
Human
Human
Human
C. elegans
C. elegans
C. elegans
C. elegans
Human
Human
Human
Human
Human
Mus musculus
Mus musculus
Table 7. List of MeSH and Gene clusters and most significant terms
Cluster ID
M_G
Name
Nr of abstracts
Term category
Term
Z-Score
6368
153220
181
157592
142368
…
21457
21418
12813
18631
17534
…
GENE
GENE
GENE
GENE
GENE
…
MeSH
MeSH
MeSH
MeSH
MeSH
…
SNRPN
AT5g62530
AGXT
oxyR
NPH1
172
183
129
162
21
Biological Sciences
Biological Sciences
Diseases
Biological Sciences
Biological Sciences
Genomic Imprinting
Gene Expression Regulation
Hyperoxaluria
Oxidative Stress
Phototropism
-791.94
-484.08
-477.15
-434.4
-429.53
Thoracic Cavity
Thoracic Wall
Giant Cells, Foreign-Body
Diagnostic Techniques, Urological
Tin Polyphosphates
14
65
132
72
190
Anatomy
Anatomy
Diagnostic, Therapy
Diagnostic, Therapy
Chemicals and Drugs
Abdominal Cavity
Abdominal Wall
Absorbable Implants
Absorbable Implants
Acidulated Phosphate Fluoride
-27.83
-32.69
-27.1
-17.53
-9.46
List truncated. The complete list (18703 lines) is available at http://www.pdg.cnb.uam.es/supplement/gene_navigation/index.html.
8