Warren-Proposal-Mining-TF-Genes-Disease-2007-07

advertisement
Thesis Proposal (Warren Cheung)
106739332
Page 1 of 17
To think About:
Improve Figures
Coordinate terminology (relationship/association/link)





Other Evidence sources
Gene -> PubMed Articles -> Disease
Gene -> Orthologous Gene -> PubMed Articles -> Disease
Gene -> Interacting/Regulated Gene -> PubMed Articles -> Disease
Gene -> Structural Element -> PubMed Articles
Gene -> Function -> Related Genes -> PubMed Articles -> Disease
Better word choice?
Redefine Evidence as Property? Entity == ?
Linkage/Association/Relationship/…
Extraction and Evaluation of Transcription
Factor Gene-Disease Association
Thesis Proposal for Doctor of Philosophy
Warren Cheung
Supervised by
Francis Ouellette
Wyeth Wasserman
Thesis Proposal (Warren Cheung)
106739332
Page 2 of 17
Table of Contents
Table of Contents .................................................................................................. 2
A. Problem Statement........................................................................................ 2
Summary of Goals .............................................................................................. 2
Motivation ........................................................................................................... 3
Example Use Cases ............................................................................................. 3
Existing Methods ................................................................................................ 3
B. Proposed Method .......................................................................................... 5
Genes................................................................................................................... 6
Disease ................................................................................................................ 6
Features ............................................................................................................... 6
Linkages .............................................................................................................. 7
Quantitative Evaluation ...................................................................................... 8
Validation ............................................................................................................ 9
C. Goals ............................................................................................................. 10
1) Main TF Gene-Disease Association Prediction Model ............................ 10
2) TF Gene-Disease Association Property Predictions ................................. 10
3) Gene Cluster-Disease Association Predictions ......................................... 10
Common Goals ................................................................................................. 10
D. Project .......................................................................................................... 10
Principles........................................................................................................... 10
E. Appendix I - Data Sources ......................................................................... 11
Genes................................................................................................................. 11
Disease .............................................................................................................. 13
Evidence ............................................................................................................ 14
Prototype Implementation ................................................................................. 15
F. References .................................................................................................... 16
A.
Problem Statement
The purpose of this research will be to identify effective methods of quantitatively
evaluate the relationship between transcription factor genes and diseases via literature
evidence, identifying existing associations and predicting novel associations. To
accomplish this, I shall explore ways to link various forms of evidence with genes and
diseases, quantitative methods to evaluate the resulting associations, and validate the
resulting analyses.
Summary of Goals
1. Main TF Gene-Disease Association Prediction Model
Evaluate quantitatively the association between each transcription factor gene and
each disease.
Thesis Proposal (Warren Cheung)
106739332
Page 3 of 17
2. TF Gene-Disease Association Property Predictions
Evaluate quantitatively what properties are relevant to a gene-disease pairing.
3. Gene Cluster-Disease Association Predictions
Identify clusters of similar genes associated to disease.
Motivation
Transcription factors are regulators of gene expression, involved via the
recruitment of other transcription initiation factors as well as causing DNA
conformational change. They can also act as part of protein complexes. Brain diseases
are a broad disease area, encompassing a wide range of complex, abnormal phenotypes,
including combinations of lethality, neurodegeneration, paralysis and behavioural
abnormality. Many diseases are not very well understood or well characterised, and
many have complex genetic components involving multiple genes.
Transcription factors in particular play a key role in the brain. Given the
incredible diversity of the neuronal and glial cells and their complex arrangement, the
careful balance of transcription factors is vital to the proper development of the brain,
determination of the cell subtype and migration. This relationship continues through to
the adult brain, where transcription factor activity is linked to neuronal survival,
differentiation, proper cellular function and neuroplasticity.
Existing databases and analyses can be leveraged as sources of information to use
with my research. For example, databases such as JASPAR, OregAnno, TF-Cat and
PAZAR can provide information on transcription factors, via evidence of transcription
factor activity (TF-Cat) as well as interaction with DNA binding sites (PAZAR). Data
from the Pleiades project, studying region-specific promoters in the mouse brain, could
potentially be used to validate results from this thesis, linking gene promoter elements
with expression in specific brain regions.
Example Use Cases
A researcher performs a microarray experiment, comparing expression of genes
from tissue in individuals with and without symptoms of a neurological disease. From
the set of genes showing differential expression between these two conditions, the user is
looking for the set of genes most likely to be involved in diseases of interest and
supporting evidence for such relationships.
A researcher wishes to get a ranked list of known and candidate genes and
properties relevant to a particular disease, each with a list of supporting evidence,
highlighting potential pathways and regulatory relationships between these genes.
Existing Methods
Existing methods concentrate on analyzing sets of candidate genes and either
reducing or ranking the genes in the set. Methods use a variety of input data sources,
from numerical features derived from the raw DNA and protein sequences, annotations of
proteins and genes, to text mining PubMed abstracts and OMIM articles. The current
methods focus on using properties from a representative set of genes to identify similar
genes from the candidate set. A collection of these methods was applied together towards
Thesis Proposal (Warren Cheung)
106739332
Page 4 of 17
identifying genes responsible for diabetes and obesity(Tiffin, Adie et al. 2006).
CAESAR(Gaulton, Mohlke et al. 2007), Endeavor(Aerts, Lambrechts et al. 2006) and an
update to G2D(Perez-Iratxeta, Bork et al. 2007) are more recent developments in this
field.
[Summary Table of Methods used by the related work]
One method for identifying disease-related genes involved clustering the diseases
in OMIM, rather than the disease genes, using indices such as primary tissue involved,
age of onset, primary etiology, episodic occurrence and their mode of inheritance.
Similarity between two disease is the weighted contributions of each of these indices.
Once the clusters are determined (using a strategy that involves manual thresholding by a
human expert), the candidate genes are compared to the disease genes underlying the
diseases in each cluster using the annotations from GOA. The score for a candidate gene
for a disease cluster is the average, over all GO terms, of the ratio of occurrences of the
GO term in the cluster, if it matches the candidate gene (otherwise 0), and the
occurrences of the GO term in all disease genes. This score is then downscaled by the the
number of genes in the cluster. They validate their results using leave-one-out crossvalidation.
One method to tackle the general problem of identifying pertinent genes is to
narrow the relevant genes via specific constraints, with the output being results that
satisfy some or all these constraints. GeneSeeker(Van Driel, Cuelenaere et al. 2005) can
find genes within a chromosomal location that are localized in particular tissues, by
looking at human and mouse expression data. Another method of associating disease
genes to anatomical locations(Tiffin, Kelso et al. 2005) performed text mining of
PubMed abstracts to associate eVOC anatomical ontology terms to disease gene names.
Another method is to treat the problem as a machine learning problem, and use
the representative set of genes as training data. In DGP(López-Bigas and Ouzounis
2004), this technique is used to find features common to disease genes in general, using a
decision tree classifier trained sample disease and control proteins. Features were protein
length, as well as BLASTP ratios (conservation score) between a protein and its highest
scoring homologue within taxonomic groups (representing phylogenetic conservation and
extent) and the conservation score with the closest paralogue. Their analysis indicates
that, on average, hereditary disease genes (genes taken from OMIM) are longer, more
conserved, phylogenetically extended and without close paralogues.
PROSPECTR(Adie, Adams et al. 2005) uses a wider variety of features, including
the length of the gene, the length of its coding sequence, the length of its cDNA, length
of the protein, GC content and percentage protein identity with its nearest homologue in
various species (mouse, worm, fly). They use an alternating decision tree, taking again
genes from OMIM and comparing against genes not found in OMIM. They also
generated two independent test sets – one using genes from the Human Gene Mutation
Database with randomly selected control genes, and the another set of 54 genes not in
OMIM, but known to be involved in oligogenic disorders, again with a set of randomly
selected control genes.
POCUS(Turner, Clutterbuck et al. 2003) takes another approach at identifying
disease-related genes. The input in this case is not all disease-related genes, but rather a
Thesis Proposal (Warren Cheung)
106739332
Page 5 of 17
selected training set of genes (from differing susceptibility regions), that are
representative for the disease in question. POCUS will then look for common features
between the training genes – InterPro domains, GO annotations, similar expression
profile – and compares against the chance these common features would occur by chance.
This method assumes that genes related to the disease are more likely to share functional
annotation than chance.
G2D(Perez-Iratxeta, Bork et al. 2002) links genes from a specified genomic locus
to diseases by examining PubMed MeSH disease and chemical term annotation and
RefSeq GO annotations. MeSH disease terms were mapped to MeSH chemical terms via
co-occurring annotation of PubMed articles. RefSeq GO annotations were linked to the
MeSH chemical terms via the PubMed references in the GO annotations. Fuzzy set
relation scores were generated for these pairwise associations as the ratio of the
cardinality of the intersection against the union. The score for the combined diseasechemical-gene relation is defined as the product of the two pairwise relations, and the
score for a disease-gene relation is simply the maximum of all possible scores. Recently,
it has been developed into a web server(Perez-Iratxeta, Wjst et al. 2005), and the most
recent update(Perez-Iratxeta, Bork et al. 2007) includes several other methods of
inferring disease-gene associations, involving the user providing genes from other
genomic regions related to the disease. The first method is more stringent – it looks for
disease genes sharing functional similarity with the specified genes. The second method
looks for functional association via protein-protein interactions (provided by the STRING
database).
The Endeavor system(Aerts, Lambrechts et al. 2006) aims to create an extendible
system for prioritizing disease genes using heterogeneous data sources. The input to the
system is a training set of genes. They evaluated the performance of the system against
monogenic diseases (automatically extracted from OMIM), polygenic diseases (six genes
recently determined to be involved in polygenic disease) and also for functional role in
regulatory pathways (by examining differential RT-PCR expression). They also
performed functional validation in zebrafish, searching for DiGeorge syndrome (DGS),
by using a training set of genes causing DGS and DGS-like symptoms. This resulted in
both the prioritization of TBX1, a known DGS-related gene, and YPEL1, which yielded
in DGS-like defects when expression was knocked down in vivo.
More recently, CAESAR(Gaulton, Mohlke et al. 2007) takes an representative
input text on the disease and uses text mining to determine relevant ontology terms. For
each data sources (including GOA, InterPro and protein-protein interaction databases),
genes are ranked based on the annotated ontology terms. The gene ranks are then
integrated, using the functions sum, mean, maximum as well as a transformed score that
considers both the rank of a gene for each data source and the number of genes returned
by that data source.
B.
Proposed Method
We propose a method that extracts gene-disease associations, emphasizing
verifiable supporting evidence for the predicted associations and a quantitative evaluation
of the strength of the association. We shall investigate both associations between genes
and disease, as well as properties of the gene-disease association.
Thesis Proposal (Warren Cheung)
106739332
Genes
Page 6 of 17
Diseases
Evidence
We shall consider three base entities: Genes, Diseases, Evidence and
relationships between these entities. Our goal will be to predict Gene-Disease
relationships based on the existence of relationships within and between the entities,
creating paths between the Gene and Disease entities. Starting with relationship of
shared evidence between a gene and a disease, we will also consider less direct
relationships, such as orthologous genes in other species and related diseases.
These paths of supporting evidence will be quantitatively evaluated, making it
possible to both extract strongly supported gene-disease linkages and to rank these
linkages.
Although the thesis itself will investigate properties of transcription factor genes
in diseases, the methods and analysis will be designed for general application. For the
initial analysis of the main gene-disease associations, we shall investigate brain diseases
specifically. Once we reach the stage of mining property associations and analysis of
clusters of genes, we shall select a second disease area to look at to both allow more
variety in analyses and demonstrate the generality of the method.
Genes
We shall also consider all genes in Entrez Gene as our primary source for genes,
mapping references to genes, DNA, RNA and protein products as needed. We shall
identify transcription factors via GOA, supplemented by genes in a TF-specific database,
TF-Cat. The Ensembl Gene set may be mapped to, combined with, or used as an
alternative to Entrez Gene.
Disease
We shall use the disease terms in the MeSH ontology as a primary source for
disease terms. Other vocabularies/ontologies, such as the UMLS Metathesaurus
concepts, Disease Ontology, ICD and SNOMED CT may also be used in conjunction or
in place of the MeSH ontology.
Features
In general, features encompass all descriptive properties of genes and diseases,
qualitative and quantitative. Qualitative features include ontology or vocabulary
annotations, such as from GOA for genes and MeSH terms for PubMed articles, or may
be free text, such as GeneRIFs. Quantitative features include numerical attributes, such
Thesis Proposal (Warren Cheung)
106739332
Page 7 of 17
as the length of coding sequence, as well as derived numerical attributes, such as BLAST
similarity score to the nearest murine homologue.
Evidence
We shall consider PubMed articles as a primary source of supporting evidence.
All other forms of experimental evidence (microarray data, gene linkage studies) will be
mapped to the relevant PubMed article in order to be considered. Additionally, we can
consider properties derived from primary data sources, such as length of the protein
sequence from the NCBI and Ensembl sequence repositories.
Linkages
To find evidence associating transcription factors with diseases, we shall look at
integrate and evaluate the strength of the links between genes, evidence and disease. This
divides the linkages into five broad categories: Gene-Gene, Gene-Evidence, EvidenceEvidence, Evidence-Disease and Disease-Disease.
Gene-Gene relationships include homology and gene interactions. When
considering a human transcription factor gene, information can be gleaned from paralogs,
highly similar genes potentially arising from an ancestral gene duplication event, and
orthologs in a closely related species. Gene interaction includes protein-protein
interactions as well as regulatory mechanisms, from interfering RNA to the
transcriptional regulation effects at DNA binding sites of the transcription factors. These
related genes are likely to share elements in common to the considered human TF. From
the presumed evolutionary relationships, paralogs are likely to share function and
orthologs are likely to perform the same role. [REF basic ortholog = function], although
recently there is evidence supporting significant divergence between the mouse and
human genome-wide transcription factor-DNA binding profile (Odom, Dowell et al.
2007). Interaction partners and downstream regulated genes are likely to be involved in
some common process. The gene-gene relationships will be extracted from curated
sources such as Orthologene, and also computationally derived via commonly-used gene
similarity metrics such as BLAST E-values. Interaction databases such as BIND, Intact
and STRING will be used to extract other protein-protein relationships. The PAZAR
database can also provide TF-gene regulation relationships.
Gene-Evidence relationships include gene references in PubMed articles.
GeneRIFs, RefSeq Related Articles and Gene Ontology literature reference articles
relevant to the gene of interest. As well, the transcription factor database TF-Cat links
transcription factor genes with relevant articles. Many of the links describe the reason for
the linkage, whether via plain text (GeneRIFs and TF-Cat) or ontology terms (GOA).
[FIXME quantitative scores?]
Evidence can be linked by similarity as well as via citations. PubMed related
articles links articles in PubMed by the similarity of their abstract text, as well as
citations, both articles that cite and are cited by an article in question.
Disease-Evidence links are taken from references to a particular disease from an
article. We shall use MeSH headings, annotated by NLM curators on PubMed articles, to
link evidence to disease.
Thesis Proposal (Warren Cheung)
106739332
Page 8 of 17
Disease-Disease linkages can be gleaned from ontological relationships, and from
hierarchical arrangements in organized vocabularies. The MeSH hierarchy will be used to
determine relationship between disease entities.
Quantitative Evaluation
Scoring Relationships
To evaluate the results obtained, we shall aim to generate relevant and intuitive
numerical scoring methods. Our goal is for the scoring methods to be sufficiently general
to allow evaluation and comparison between comparisons of varied forms of evidence
and methods.
To evaluate strength of a linkage between two entities (e.g. a TF gene and a
disease) supported by evidence (e.g. from a subset of all PubMed articles), we consider a
null hypothesis – that the linkage found occurred entirely by chance. We can therefore
examine the probability of the evidence found occurring by chance. In the example, we
consider n PubMed articles that are referenced by the gene, and the k articles which are
annotated as linked to the disease. We then compare against the K articles that are
annotated as linked to the disease, and the N articles in PubMed. If we consider each
article referenced by the gene as a random draw from the pool of all articles available (the
subset of all PubMed articles), we can use a hypergeometric distribution to model the
number of articles we would see by chance annotated as linked to the disease and
quantitatively evaluate our results. Therefore, if we observe that x articles referenced by
the gene are associated to the disease,
 K  N  K 
 

N 
k  n  k 

Pr( k  x)  
N
ik
 
n
These results equate to performing a one-tailed Fisher’s exact test. Should this
prove too computationally expensive or inaccurate to compute, we can approximate this
using the binomial distribution, if n is much smaller than N-K and K.
Multiple Testing Correction
Multiple testing correction will be employed in cases where we examine the
potential association between a gene and each of the diseases – for example, when the
investigator specifies a particular gene, and requests a list of all diseases associated with
the gene. The danger in such case is potentially increased Type I (false positive) error.
In such a case, we can employ the Bonferroni (familywise error) correction – effectively,
we divide the significance level (e.g. α = 0.05) we are looking for by the number of tests
we employed (e.g. the number of diseases we tested for) and count significant the pvalues that fall below this conservative threshold.
When considering many tests, the penalty imposed by Bonferroni correction
prove too extreme, resulting in a substantial increase of Type II (false negative) error. As
an alternative, we could employ Benjamini-Hochberg (false discovery rate) correction to
control the Type I error explicitly. In this case, rather than controlling for single
erroneous rejection of the null hypothesis, we control the fraction of erroneous rejection.
Thesis Proposal (Warren Cheung)
106739332
Page 9 of 17
This method is has shown to be applicable when the tests are independent and
when the tests are positively correlated, and has been used for correction of GO term
overrepresentation.
Joint Probability
In general, we can utilize the overrepresentation analysis to determine when two
When considering two links, linking gene A to feature B, and feature B to disease C, with
p-values p(B|A) and p(C|B), we often wish to estimate the probability of the secondary
relation, p(C|A). Assuming that the relation A->B->C is transitive, and that p(B|A) and
p(C|B) are independent, we can compute p(C|A) as the joint probability p(B|A AND
C|B). Then probability of the combined link will be p(AB) + p(BC) - p(AB)p(BC).
Another heuristic, useable when we wish to examine multiple links, is the shortest
path heuristic(Zhou, Kao et al. 2002). Each link becomes the weight of an edge in a
graph, and the length of the shortest path between two points is the value given for that
relation.
Validation
Validation of the data will be performed in three ways — using OMIM gene and
disease entries, comparison with more recent data and manual verification. This will test
the sensitivity of our method, by providing positive examples. As it is impossible to rule
out a future link between a gene and disease, there is no negative data.
To evaluate the basic accuracy of the relationship links suggested by the system,
we can use OMIM entries noting a link between a gene and disease will comprise one set
of positive data Y, to be compared against the results generated by our system X. By
X Y
taking the ratio
, we can evaluate the sensitivity of our method — the fraction of
Y
the positive examples that are correctly identified by the system. This will be the
evaluation of the predictive capability of the system. Doing this using the most recent
versions of the database, this would evaluate the ability of the system to reconstruct the
known relationships in OMIM. Similarly, we can manually evaluate the results of the
system using the associated evidence, such as determining whether the PubMed articles
referenced support the gene-disease association hypothesis.
To look at the predictive ability of the system, we can also freeze the databases
loaded by the system before a particular date and use these slightly obsolete databases for
the analysis. We can then curate the more recent literature since the frozen time-point for
novel gene-disease linkage discoveries which will provide a second set of positive data.
Methods to generate this curated dataset would be to look for new OMIM disease and
gene entries and manual curation. Manual curation can be assisted by the system by
generating relationships and verifying the evidence manually.
OMIM is a fairly conservative source of gene and disease information, and
therefore will not necessarily have all the most recent discoveries curated. One method
to more accurately place the time
We can also manually verify the evidence supplied by the system for a particular
gene-disease linkage. By examining the PubMed articles referenced, we can evaluate
whether it is relevant to the gene, the disease, both or none. This form of verification will
evaluate the relevance of the data extracted by the system.
Thesis Proposal (Warren Cheung)
106739332
C.
Page 10 of 17
Goals
1) Main TF Gene-Disease Association Prediction Model
Associations between genes and diseases will be identified, implicating specific
genes with specific diseases with a quantitative strength.
 Tool to derive associations between genes and diseases from the database
 Model to quantitatively evaluate associations extracted
 Validation of the associations derived and the model
2) TF Gene-Disease Association Property Predictions
Associations between genes and diseases will be expanded to elucidate additional
properties, such as the functional role of the gene in the disease, the affected locations, as
well as investigate the relationship between genes involved in the disease, such as via
protein-protein interactions and transcriptional regulation.
 Tool to analyse gene-property-disease associations
 Model to quantitatively evaluate the properties derived
 Validation of the additional properties derived and the model
3) Gene Cluster-Disease Association Predictions
Meta-analysis, using the results from previous association analyses, will focus on
finding clusters of genes related to disease. Using traditional methods, such as k-means,
as well as more recently developed methods such as (Jochen’s rank-based prior) and
OPTICS, we shall investigate whether the genes can be clustered in disease and diseaseproperty meaningful ways.
 Cluster genes, looking for disease and disease-property clusters
 Validation by examining known disease-related genes and disease genes
involved in pathways
Common Goals
Data on genes, diseases and evidence used to support the gene-disease
associations will be extracted and stored, to support analysis and validation.
 Database of transcription factor genes, diseases and evidence data
 Tool to create and update the database from relevant data sources
D.
Project
Principles
Quantitative TF Gene-Disease Relationships
The tools will allow examination of known and predicted gene-disease
relationship, and quantitatively evaluate these relationships. The evidence supporting
predictions will be accessible, allowing users direct means to confirm the predictions.
Thesis Proposal (Warren Cheung)
106739332
Page 11 of 17
The system will be designed to accommodate more general use in other disease areas or
types of genes.
Open Access
Freely available data sources will be used. The tools developed and results of the
analyses will be made publicly available and published in open access journals.
Modular, Efficient Programmatic Framework
A comprehensive toolkit for analysis will be developed. Scalable algorithms will
be used to handle the extremely large, expanding datasets involved. Efficient methods to
extract data from the large dataset will be developed.
E.
Appendix I - Data Sources
The system will be designed to provide a complete storage solution for genes,
diseases and evidence from disparate databases, as well as existing and computed
annotations and relationships. A consistent interface will allow straightforward access to
all the data. This data will be stored in a database, with programs written to both load
and update from the data sources.
Initially, we can separate our concerns into three areas — genes, diseases and
evidence. Genes refer to loci on the chromosomes of humans, generally protein-coding,
including the relevant regulatory elements. Diseases refer to abnormal human
phenotypes. Evidence refers to all the data that will be used to link genes to disease.
Due to the extreme sizes of the data sources involved (16 million entries in
PubMed alone), we shall consolidate the data in a local database. This will ensure
maximal efficiency for accessing the data when performing the analyses. As well, this
will put all the data in a common, controlled format, which will simplify the downstream
analyses and make the development of the subsequent tasks independent on the data
acquisition task.
Genes
The ultimate goal of the thesis is to link human transcription factor genes with
human diseases. However, genes in other organisms, especially the closely related and
well-studied mouse models, as well as other genes, such genes regulated by transcription
factors, will need to be considered. As well, in existing methods, candidate genes may be
specified directly by the user or selected via broad chromosomal regions. To
accommodate the range of genes that may be used in our analyses, I shall use Entrez
Gene as the primary source reference for genes.
Entrez Gene
This NCBI database tracks genes annotated in genomes, from known genes to
protein coding regions (e.g. in viruses) and predicted genes. A unique gene
identifier is assigned for each gene in each species. Data in Entrez Gene comes from both
curated and automatically generated sources. This includes information from and links to
Thesis Proposal (Warren Cheung)
106739332
Page 12 of 17
sequences in NCBI Reference Sequence (RefSeq). Gene Ontology (GO) annotations are
provided by the Gene Ontology Annotation (GOA) Database. Data from Entrezaccessible sources at the NCBI can be accessed via NCBI Entrez EUtils, as well as
downloaded via FTP as compressed text files.
Gene Ontology
GO terms for a collaborative effort to provide a consistent nomenclature for gene
annotations and for indicating the strength of the evidence supporting such annotations.
In addition to the three original members, the model organism databases FlyBase,
Saccharomyces Genome Database external link (SGD) and the Mouse Genome Database
(MGD), there are now over ten full members, including GOA, and several associate
members. GO is composed of three main ontologies - biological processes, cellular
components and molecular functions. Annotations are described by a three-letter
controlled vocabulary of evidence codes, from inferences by electronic annotation (IEA)
to traceable author statements (TAS). However, GO does not describe "abnormal"
features, such as mutant or disease-specific traits. The Gene Ontology Annotation
Database is responsible for annotations to proteins in the human, chicken and cow
genomes in UniProtKB, and is supplemented by annotations from other groups. Priority
is given to proteins without annotation, those with disease relevance and those relevant to
high-throughput analyses. We use the GO term “transcription factor” to identify genes
that are transcription factors.
Statistics
All Genes
2631524
Human Genes
38624
TF Genes
7866
Human TF Genes
1209
1
10
100
1000
10000
100000 1000000 1E+07
Other sources for Transcription Factors
I shall also examine the integration of other data sources to increase both the
coverage of transcription factors as well as providing more direct links to literature.
Curated TF databases, such as the locally developed TF-Cat database, can provide a
specialised, annotated resource for transcription factors.
1E+08
Thesis Proposal (Warren Cheung)
106739332
Page 13 of 17
Disease
No standard ontology or vocabulary for diseases is currently in widespread use.
However, several standards exist for categorizing data in various fields relate closely —
Medical Subject Headings, used to annotate PubMed articles, the International
Classification of Diseases, standard terminology used worldwide to track morbidity, and
SnoMed CT, an emerging standard for health records. The Unified Medical Language
System Metathesaurus and the Disease Ontology will provide methods of unifying these
terminologies. [GALEN project?]
[licensing issues?]
Medical Subject Headings (MeSH)
MeSH is a controlled vocabulary thesaurus of descriptors, arranged in a
hierarchical structure. Sixteen main categories (e.g. Anatomy, Disease) at the top are
divided into subcategories, and then the descriptors are placed into the tree, with more
general terms near the top to the most specific, with a descriptor potentially occurring
more than once in the tree. We shall initially use the category C, in particular, tree
number C10.228.140, "Brain Diseases", and its subheadings, to as labels of for disease.
However, as MeSH is a general subject classification system, disease labels will often be
general rather than specific – for example, “Spinocerebellar ataxias” (SCA) exists as a
distinct MeSH term, but the specific SCA types do not.
International Classification of Diseases (ICD)
Also known as the International Statistical Classification of Diseases and Related
Health Problems, this classification system, published by the World Health Organisation,
provides codes to classify diseases, and also signs of health problems such as symptoms,
social circumstances and external causes of injury. It is currently in its 10th revision, ICD10, and is used to track mortality statistics worldwide. ICD-10-CA is an enhanced
version developed by the Canadian Institute for Health Information for morbidity
classification, and was phased in from 2001-2006. ICD-9-CM, based on the 9th ICD
release, is the current official standard used by U.S. hospitals. Incorporation of ICD
variants would therefore allow interoperability with morbidity data gathered and publicly
available.
Systematized Nomenclature of Medicine – Clinical Terms
(SNOMED CT)
SNOMED-CT was originally developed by the College of American Pathologists.
As of April 2007, it is owned by the International Health Terminology Standards
Development Organisation (IHTSDO). Canada is a founding member country of
IHTSDO, and is represented by the Canada Health Infoway, an organization aimed at
providing interoperable electronic health record solutions for Canadians. [EXISTING
USES?]
Thesis Proposal (Warren Cheung)
106739332
Page 14 of 17
The Unified Medical Language System (UMLS) Metathesaurus
The UMLS Metathesaurus contains database of medical terminology, provided by
the National Library of Medicine. It provides a mapping to unique concept identifiers
from vocabularies including MeSH, ICD and SNOMED CT.
Disease Ontology
This controlled vocabulary, currently under development at the Center for Genetic
Medicine, Northwestern University, aims to facilitate mapping diseases and conditions,
and uses the UMLS to map terminologies such as SNOMED and ICD into the Open
Biomedical Ontology format. The previous stable release version of the ontology was
based primarily on ICD-9-CM.
(Online) Mendelian Inheritance in Man (OMIM)
OMIM provides access to curated reports in human-readable text format on both
genes implicated in diseases and diseases with a genetic component. Articles include
inline PubMed references as supporting evidence. OMIM has been used [REFS] as a
source for genetic diseases, however, this would only provide a list of known
(potentially) genetic diseases, leaving out diseases that do not yet have a known genetic
component.
Evidence
As we are focusing research on verifiable experimental evidence, our sources of
data will include scientific articles summarizing the results of experiments in addition to
other databases of experimental results and the results of simple analyses. PubMed will
provide a basic source of scientific articles.
PubMed
PubMed is a searchable citation database at the NCBI, indexing biomedical
literature. Bibliographical citation information is taken primarily from the National
Library of Medicine (NLM) MEDLINE database, although some journals indexed for
their biomedical articles have all their articles indexed, and there are also legacy articles
from OLDMEDLINE, as well as other initiatives that experimented with indexing other
scientific literature.
GeneRIF
Gene Reference Into Function (GeneRIF) are annotations both submitted by the
public and curated by the National Library of Medicine, describing references to gene
function. Gene function in this case defined very broadly, referring to not only biological
function, but also information about the gene's role in disease, as well as its discovery and
mapping. In addition to general GeneRIFs, there are also two other major sources of
GeneRIFs: information from HIV-1, the Human Protein Interaction Database, and
information from the protein-protein interaction databases BIND, BioGRID, EcoCyc and
HPRD. All GeneRIFs include a reference to at least one PubMed article as evidence.
GeneRIFs associate PubMed evidence with genes.
Thesis Proposal (Warren Cheung)
106739332
Page 15 of 17
MeSH annotations
PubMed/MEDLINE entries are continually being indexed using MeSH terms by
curators at the NLM. Each article is indexed by one or more MeSH terms, each of which
may also have one of 83 topical qualifier subheadings (e.g. analysis, education or
therapy) to potentially indicate a more specific topic.
Statistics
Total Genes with GeneRIFs: 33216
Human TF Genes with GeneRIF: 914
PubMed articles: 16,120,074
PubMed articles with MeSH headings: 15,806,221
PubMed articles with Brain Disease MeSH (or more specific) terms: 660538
MeSH terms: 47143
Unique MeSH terms: 24355
MeSH terms under Brain Diseases: 312
Other forms of Evidence
The PAZAR database identifies regulatory elements associated with genes,
potentially revealing interconnected regulatory programs. The String database
incorporates both experimental protein-protein interaction evidence as well predicted
interactions. The KEGG database provides pathways.
Experimental evidence has also been incorporated in programs such as
GeneSeeker and POCUS. Other annotations used include eVoc annotations (Tiffin),
InterPro domains, secondary properties derived from DNA or protein sequences (DGP,
PROSPECTR). As well, some programs use text mining, extracting information from
textual sources such as PubMed abstracts and OMIM articles.
Prototype Implementation
The data will be stored in a relational database. Each of the major concepts —
genes, diseases and evidence — will be represented as abstract entities. A specific
instance of an entity will be both a member of the abstract, and also store specific
information separately. The abstract entities will only contain data relevant to the
analyses for efficiency — specific information can be referenced afterwards.
Thesis Proposal (Warren Cheung)
106739332
Page 16 of 17
Entrez
Gene
Related
Articles
Entrez Gene ID
Locus name
PMID
Related_PMID
Score
GeneRIF
Entrez Gene ID
PMID
Heading
Description
F.
PubMed
PMID
Title
MeSH
Term
Tree_number
PubMed
MeSH
Annotations
PMID
MeSH Term
MeSH Qualifier
Major Topic?
References
Adie, E., R. Adams, et al. (2005). "Speeding disease gene discovery by sequence based
candidate prioritization." BMC Bioinformatics 6(1): 55.
Aerts, S., D. Lambrechts, et al. (2006). "Gene prioritization through genomic data
fusion." Nat Biotech 24(5): 537-44.
Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes
for complex human traits." Bioinformatics.
Gaulton, K., K. Mohlke, et al. (2007). "A computational system to select candidate genes
for complex human traits."
López-Bigas, N. and C. Ouzounis (2004). "Genome-wide identification of genes likely to
be involved in human genetic disease." Nucleic Acids Research 32(10): 3108.
Odom, D., R. Dowell, et al. (2007). "Tissue-specific transcriptional regulation has
diverged significantly between human and mouse." Nat Genet 39(6): 730-2.
Perez-Iratxeta, C., P. Bork, et al. (2007). "Update of the G2D tool for prioritization of
gene candidates to inherited diseases." Nucleic Acids Research.
Perez-Iratxeta, C., P. Bork, et al. (2002). "Association of genes to genetically inherited
diseases using data mining." Nat Genet 31(3): 316-9.
Perez-Iratxeta, C., M. Wjst, et al. (2005). "G2D: a tool for mining genes associated with
disease." BMC Genetics 6(1): 45.
Thesis Proposal (Warren Cheung)
106739332
Page 17 of 17
Tiffin, N., E. Adie, et al. (2006). "Computational disease gene identification: a concert of
methods prioritizes type 2 diabetes and obesity candidate genes." Nucleic Acids Research
34(10): 3067.
Tiffin, N., J. Kelso, et al. (2005). "Integration of text- and data-mining using ontologies
successfully selects disease gene candidates." Nucleic Acids Research 33(5): 1544-52.
Turner, F., D. Clutterbuck, et al. (2003). "POCUS: mining genomic sequence annotation
to predict disease genes." Genome Biology 4(11): 75.
Van Driel, M. A., K. Cuelenaere, et al. (2005). "GeneSeeker: extraction and integration
of human disease-related information from web-based genetic databases." Nucleic Acids
Research 33(web server): 758.
Zhou, X., M.-C. Kao, et al. (2002). "Transitive functional annotation by shortest-path
analysis of gene expression data." Proceedings of the National Academy of Sciences of
the United States of America 99(20): 12783-8.
Download