Presentation

advertisement
Genes
Disease
Disease
s Disease
s s Disease
Disease
s s Disease
Diseases
s
Medical
Informatics
Diseases
Genes
Diseases
Genes
Anatomy
Physiolog
y
Bioinformatics
Novel
relationships &
Deeper insights
Diseases
Integrative Genomics For
Understanding Disease Process
Anil Jegga
Division of Biomedical Informatics,
Cincinnati Children’s Hospital Medical Center (CCHMC)
Department of Pediatrics, University of Cincinnati
Cincinnati, Ohio - 45229
Anil.Jegga@cchmc.org
Acknowledgement
•
•
•
•
•
•
Jing Chen
Mrunal Deshmukh
Sivakumar Gowrisankar
Chandra Gudivada
Arvind Muthukrishnan
Bruce J Aronow
Two Separate Worlds…..
Disease
World
Medical Informatics
Bioinformatics
Genome
Regulome
Transcriptom
e
Proteome
Disease
Database
Patient
Records
Clinical
Trials
Interactome
Metabolome
Variome
Pharmacogenome
PubMed
→Name
Physiome
OMIM
→Synonyms
Clinical
→Related/Similar Diseases
Synopsis
Pathome
→Subtypes
→Etiology
→Predisposing Causes
→Pathogenesis
→Molecular Basis
354 “omes” so far………
→Population Genetics
→Clinical findings
→System(s) involved
and there is “UNKNOME” too →Lesions
→Diagnosis
genes with no function known
→Prognosis
http://omics.org/index.php/Alphabetically_ordered_list_of_omics
→Treatment
(as onExchange…
October 15, 2006)
→Clinical Trials……
With Some Data
Motivation
To correlate diseases with anatomical parts
affected, the genes/proteins involved, and
the underlying physiological processes
(interactions, pathways, processes). In other
words, bringing the disciplines of Medical
Informatics (MI) and BioInformatics (BI)
together (Biomedical Informatics - BMI) to
support personalized or “tailor-made”
medicine.
How to integrate multiple types of genome-scale data
across experiments and phenotypes in order to find genes
associated with diseases
Model Organism Databases: Common Issues
• Heterogeneous Data Sets - Data Integration
– From Genotype to Phenotype
– Experimental and Consensus Views
• Incorporation of Large Datasets
– Whole genome annotation pipelines
– Large scale mutagenesis/variation projects (dbSNP)
• Computational vs. Literature-based Data
Collection and Evaluation (MedLine)
• Data Mining
– extraction of new knowledge
– testable hypotheses (Hypothesis Generation)
Support Complex Queries
• Get me all genes involved in brain development
that are expressed in the Central Nervous
System.
• Get me all genes involved in brain development in
human and mouse that also show iron ion binding
activity.
• For this set of genes, what aspects of function
and/or cellular localization do they share?
• For this set of genes, what mutations are
reported to cause pathological conditions?
Bioinformatic Data-1978 to present
•
•
•
•
•
•
DNA sequence
Gene expression
Protein expression
Protein Structure
Genome mapping
SNPs & Mutations
•
•
•
•
•
•
Metabolic networks
Regulatory networks
Trait mapping
Gene function analysis
Scientific literature
and others………..
Human Genome Project – Data Deluge
Database name
Nucleotide
Protein
Structure
Genome Sequences
Popset
SNP
3D Domains
Domains
No. of Human Gene Records
currently in NCBI: 31507
(excluding pseudogenes,
mitochondrial genes and obsolete
records).
Includes ~460 microRNAs
GEO Datasets
GEO Expressions
UniGene
UniSTS
PubMed Central
HomoloGene
Taxonomy
Records
11,512,792
313,099
8,490
51
20,801
12,702,095
31,862
25
2,969
9,783,946
86,804
322,092
3,140
20,123
1
NCBI Human Genome Statistics – as on October 18, 2006
The Gene Expression Data Deluge
Till 2000: 413 papers on microarray!
Year
2001
2002
2003
2004
2005
2006-
PubMed
Articles
834
1557
2421
3508
4400
4083+
Problems Deluge!
Allison DB, Cui X, Page GP,
Sabripour M. 2006. Microarray
data analysis: from disarray to
consolidation and consensus.
Nat Rev Genet. 7(1): 55-65.
Information Deluge…..
•
3 scientific journals in 1750
•
Now - >120,000 scientific journals!
•
>500,000 medical articles/year
•
>4,000,000 scientific articles/year
•
>16 million abstracts in PubMed derived from >32,500
journals
•
>4.5 billion distinct web pages indexed by Google!
Google Search for integrative genomics: ~930,000 hits
“integrative genomics”: ~112,000 hits
A researcher would have to scan 130 different journals and
read 27 papers per day to follow a single disease, such as
breast cancer (Baasiri et al., 1999 Oncogene 18: 7958-7965).
Data-driven Problems…..
What’s in a name!
Rose is a rose is a rose is a rose!
Gene Nomenclature
Disease names
•Accelerin
•Draculin
•
•Antiquitin
•Fidgetin
•Bang Senseless
•Gleeful
•
•Bride of Sevenless •Knobhead
•
•Christmas Factor •Lunatic Fringe •
•Cockeye
•Mortalin
•
•Crack
•Orphanin
•Draculin
•Profilactin
•Dickie’s small eye •Sonic Hedgehog
Mobius Syndrome with
Poland’s Anomaly
Werner’s syndrome
Down’s syndrome
Angelman’s syndrome
Creutzfeld-Jacob
disease
1.
Generally, the names refer to
some feature of the mutant
phenotype
2.
Dickie’s small eye (Thieler et al.,
1978, Anat Embryol (Berl), 155:
81-86) is now Pax6
3.
Gleeful: "This gene encodes a
C2H2 zinc finger transcription
factor with high sequence
similarity to vertebrate Gli
proteins, so we have named the
gene gleeful (Gfl)." (Furlong et
al., 2001, Science 293: 1632)
•
How to name or describe proteins, genes, drugs, diseases and conditions consistently and
coherently?
•
How to ascribe and name a function, process or location consistently?
•
How to describe interactions, partners, reactions and complexes?
Some Solutions
•
Develop/Use controlled or restricted vocabularies (IUPAC-like naming conventions,
HGNC, MGI, UMLS, etc.)
•
Create/Use thesauruses, central repositories or synonym lists (MeSH, UMLS, etc.)
•
Work towards synoptic reporting and structured abstracting
Some more ambiguous examples……..
• The yeast homologue of the human gene PMS1,
which codes for a DNA repair protein, is called
PMS2; whereas yeast PMS1 corresponds to
human PMS2!
• Even more confusing, 4,257 abbreviated names
were used to refer to more than one gene. Top
of the list was MT1, used to describe at least 11
members of a cluster of genes encoding small
proteins that bind to metal ions (Nature: 411: 631-632).
and there are some weird ones too……..
• AR*E: aryl sulfatase E in all species
• f**K: fuculokinase gene in bacteria
Rose is a rose is a rose is a rose….. Not Really!
What is a cell?
•
any small compartment
•
(biology) the basic structural and functional unit of all
organisms; they may exist as independent units of life (as in
monads) or may form colonies or tissues as in higher plants
and animals
•
a device that delivers an electric current as a result of
chemical reaction
•
a small unit serving as part of or as the nucleus of a larger
political movement
•
cellular telephone: a hand-held mobile radiotelephone for use
in an area divided into small sections, each with its own shortrange transmitter/receiver
•
small room in which a monk or nun lives
•
a room where a prisoner is kept
Image Sources: Somewhere from the internet…
Semantic Groups, Types and Concepts:
•
Semantic Group Biology – Semantic Type Cell
•
Semantic Groups Object OR Devices – Semantic
Types Manufactured Device or Electrical Device
or Communication Device
•
Semantic Group Organization – Semantic Type
Political Group
Foundation Model Explorer
The REAL
Problems
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
COLORECTAL CANCER [3-BP DEL, SER45DEL]
COLORECTAL CANCER [SER33TYR]
PILOMATRICOMA, SOMATIC [SER33TYR]
HEPATOBLASTOMA, SOMATIC [THR41ALA]
DESMOID TUMOR, SOMATIC [THR41ALA]
PILOMATRICOMA, SOMATIC [ASP32GLY]
OVARIAN CARCINOMA, ENDOMETRIOID TYPE, SOMATIC [SER37CYS]
HEPATOCELLULAR CARCINOMA SOMATIC [SER45PHE]
HEPATOCELLULAR CARCINOMA SOMATIC [SER45PRO]
MEDULLOBLASTOMA, SOMATIC [SER33PHE]
1.
CTNNB1
MET
HEPATOCELLULAR
CARCINOMA SOMATIC
[ARG249SER]
TP53*
Hepatocellular Carcinoma
TP53
Many disease states are
complex, because of many genes
(alleles & ethnicity, gene
families, etc.), environmental
effects (life style, exposure,
etc.) and the interactions.
aflatoxin B1, a mycotoxin
induces a very specific Gto-T mutation at codon 249
in the tumor suppressor
gene p53.
Environmental Effects
The REAL
Problems
1.
2.
3.
4.
5.
6.
7.
ALK in cardiac myocytes
Cell to Cell Adhesion Signaling
Inactivation of Gsk3 by AKT causes accumulation
of b-catenin in Alveolar Macrophages
Multi-step Regulation of Transcription by Pitx2
Presenilin action in Notch and Wnt signaling
Trefoil Factors Initiate Mucosal Healing
WNT Signaling Pathway
1.
2.
CTNNB1
HEPATOCELLULAR CARCINOMA
MET
LIVER:
•Hepatocellular carcinoma;
•Micronodular cirrhosis;
•Subacute progressive viral hepatitis
NEOPLASIA:
•Primary liver cancer
TP53
CBL mediated ligand-induced downregulation
of EGF receptors
Signaling of Hepatocyte Growth Factor
Receptor
1.
Estrogen-responsive protein Efp
controls cell cycle and breast tumors
growth
2. ATM Signaling Pathway
3. BTG family proteins and cell cycle
regulation
4. Cell Cycle
5. RB Tumor Suppressor/Checkpoint
Signaling in response to DNA
damage
6. Regulation of transcriptional activity
by PML
7. Regulation of cell cycle progression
by Plk3
8. Hypoxia and p53 in the
Cardiovascular system
9. p53 Signaling Pathway
10. Apoptotic Signaling in Response to
DNA Damage
11. Role of BRCA1, BRCA2 and ATR in
Cancer Susceptibility….Many More…..
Integrative Genomics - what is it?
Another buzzword or a meaningful concept useful for
biomedical research?
Acquisition, Integration, Curation, and
Analysis of biological data
Hypothesis
Integrative Genomics: the study of complex interactions
between genes, organism and environment, the triple helix
of biology. Gene <–> Organism <-> Environment
It is definitely beyond the buzzword stage - Universities
now have programs named 'Integrated Genomics.'
Information is not knowledge - Albert Einstein
Methods for Integration
1. Link driven federations
• Explicit links between databanks.
2. Warehousing
• Data is downloaded, filtered, integrated
and stored in a warehouse. Answers to
queries are taken from the warehouse.
3. Others….. Semantic Web, etc………
Link-driven Federations
1. Creates explicit links between databanks
2. query: get interesting results and use web
links to reach related data in other
databanks
Examples: NCBI-Entrez, SRS
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
http://www.ncbi.nlm.nih.gov/Database/datamodel/
Querying Entrez-Gene
No. of Records
Database name
PubMed
Query= p53
Query= TP53
(HGNC)
Query= p53
OR TP53
37,962
1928
38,512
PMC
9647
373
9738
Book
710
332
744
Nucleotide
7062
1603
8442
Protein
3882
314
3970
Genome
12
0
12
317
79
744
SNP
14,277
1513
14,779
Gene
1058
258
1115
723
31
735
68,000
10,539
70,718
292
129
421
OMIM
Homologene
GEO Profiles
Cancer Chr
Link-driven Federations
1. Advantages
•
•
complex queries
Fast
•
•
•
require good knowledge
syntax based
terminology problem not solved
2. Disadvantages
Data Warehousing
Data is downloaded, filtered, integrated and
stored in a warehouse. Answers to queries are
taken from the warehouse.
Advantages
Disadvantages
1. Good for very-specific,
task-based queries and
studies.
1. Can become quickly
outdated – needs
constant updates.
2. Since it is custom-built
and usually expertcurated, relatively less
error-prone.
2. Limited functionality –
For e.g., one diseasebased or one systembased.
No Integrative Genomics is
Complete without Ontologies
Gene World
• Gene Ontology
(GO)
Biomedical World
• Unified Medical
Language System
(UMLS)
The 3 Gene Ontologies
• Molecular Function = elemental activity/task
– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
– What a product ‘does’, precise activity
• Biological Process = biological goal or objective
– broad biological goals, such as dna repair or purine metabolism,
that are accomplished by ordered assemblies of molecular
functions
– Biological objective, accomplished via one or more ordered assemblies of
functions
• Cellular Component = location or complex
– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA
polymerase II holoenzyme
– ‘is located in’ (‘is a subcomponent of’ )
http://www.geneontology.org
Example: Gene Product = hammer
Function (what)
Process (why)
Drive a nail - into wood
Carpentry
Drive stake - into soil
Gardening
Smash a bug
Pest Control
A performer’s juggling object
Entertainment
http://www.geneontology.org
GO term associations: Evidence Codes
• ISS: Inferred from sequence or structural
similarity
• IDA: Inferred from direct assay
• IPI: Inferred from physical interaction
• TAS: Traceable author statement
• IMP: Inferred from mutant phenotype
• IGI: Inferred from genetic interaction
• IEP: Inferred from expression pattern
• ND: no data available
http://www.geneontology.org
What can researchers do with GO?
•
Access gene product functional
information
•
Find how much of a proteome is
involved in a process/ function/
component in the cell
•
Map GO terms and incorporate
manual annotations into own
databases
•
Provide a link between
biological knowledge and
•
gene expression
profiles
•
proteomics data
And how?
• Getting the GO and
GO_Association Files
• Data Mining
– My Favorite Gene
– By GO
– By Sequence
• Analysis of Data
– Clustering by
function/process
• Other Tools
http://www.geneontology.org/
Open biomedical ontologies
http://obo.sourceforge.net/
Unified Medical Language System Knowledge
Server– UMLSKS
http://umlsks.nlm.nih.gov/kss/
• The UMLS Metathesaurus contains information about biomedical
concepts and terms from many controlled vocabularies and
classifications used in patient records, administrative health data,
bibliographic and full-text databases, and expert systems.
• The Semantic Network, through its semantic types, provides a
consistent categorization of all concepts represented in the UMLS
Metathesaurus. The links between the semantic types provide the
structure for the Network and represent important relationships in
the biomedical domain.
• The SPECIALIST Lexicon is an English language lexicon with many
biomedical terms, containing syntactic, morphological, and
orthographic information for each term or word.
•
•
•
•
•
Unified Medical Language System
Metathesaurus
about over 1 million biomedical concepts
About 5 million concept names from more than 100 controlled vocabularies
and classifications (some in multiple languages) used in patient records,
administrative health data, bibliographic and full-text databases and expert
systems.
The Metathesaurus is organized by concept or meaning. Alternate names for
the same concept (synonyms, lexical variants, and translations) are linked
together.
Each Metathesaurus concept has attributes that help to define its meaning,
e.g., the semantic type(s) or categories to which it belongs, its position in
the hierarchical contexts from various source vocabularies, and, for many
concepts, a definition.
Customizable: Users can exclude vocabularies that are not relevant for
specific purposes or not licensed for use in their institutions.
MetamorphoSys, the multi-platform Java install and customization program
distributed with the UMLS resources, helps users to generate pre-defined
or custom subsets of the Metathesaurus.
• Uses:
– linking between different clinical or biomedical vocabularies
– information retrieval from databases with human assigned subject index terms
and from free-text information sources
– linking patient records to related information in bibliographic, full-text, or factual
databases
– natural language processing and automated indexing research
UMLSKS – Semantic Network
• Complexity reduced by grouping concepts according to the
semantic types that have been assigned to them.
• There are currently 15 semantic groups that provide a partition of
the UMLS Metathesaurus for 99.5% of the concepts.
ACTI|Activities & Behaviors|T053|Behavior
ANAT|Anatomy|T024|Tissue
CHEM|Chemicals & Drugs|T195|Antibiotic
CONC|Concepts & Ideas|T170|Intellectual Product
Semantic
Groups (15)
DEVI|Devices|T074|Medical Device
DISO|Disorders|T047|Disease or Syndrome
GENE|Genes & Molecular Sequences|T085|Molecular Sequence
GEOG|Geographic Areas|T083|Geographic Area
LIVB|Living Beings|T005|Virus
OBJC|Objects|T073|Manufactured Object
OCCU|Occupations|T091|Biomedical Occupation or Discipline
ORGA|Organizations|T093|Health Care Related Organization
PHEN|Phenomena|T038|Biologic Function
PHYS|Physiology|T040|Organism Function
PROC|Procedures|T061|Therapeutic or Preventive Procedure
Semantic
Types (135)
Concepts
(millions)
UMLSKS – Semantic Navigator
Alzheimer’s Disease – Alarming Statistics
•
The number of patients with AD in any community depends on the proportion
of older people in the group. Traditionally, the developed countries had large
proportions of elderly people, and so they had very many cases of
Alzheimer’s disease in the community at one time.
•
4.5 million AD patients in the United States today.
•
Expected to increase to 11 to 16 million by 2050.
•
In 2000, health care costs for AD patients in the United States totaled
approximately $31.9 billion, which is expected to reach $49.3 billion by 2010
(http://www.alz.org)
•
World-wide: ~18 million (projected to nearly double by 2025 to 34 million).
•
Demographic transition - Developing countries:
•
•
Increased life expectancy (current life expectancy in India is >60 years).
•
1991 India Census: 70 million people were over 60 years.
•
2001 India Census: 77 million, or 7.6% of the population.
•
By 2025, we will have 177 million elderly people.
Currently, more than 50% of people with Alzheimer’s disease live in
developing countries and by 2025, this will be over 70%.
Source: WHO & NIA
Alzheimer’s Disease – Why Computational Approaches?
• The goal of applying computational data-mining approaches
is to extract useful information from large amounts of data
by employing mathematical methods that should be as
automated as possible.
•
Computational data-mining approaches are particularly
appropriate in areas with much data but few explanations,
such as gerontology. If researchers can find/derive
patterns in data to perceive information, then information
may enhance our knowledge over aging.
•
The complexity and broad range of cellular and biochemical
events make researchers believe that there must be a
sophisticated network of AD signal transduction, gene
regulation, and protein-protein interaction events.
•
Therefore, deciphering AD-related molecular network
“circuitry” can help researchers understand AD disease
better, model details, and propose treatment ideas.
A simplistic picture
Frontal Lobe
Temporal
Lobe
Hippocampu
s
Cerebral
Cortex
Astrocytes
Basal
Nucleus of
Meynert
Cerebrum
Brain
Brain and
Nervous
System
A2M
APOE
Microglia
Alzheimer Disease
ALOX12 ABCA1
ABCA2
NME1
Neurons
NEF3
PARK2
STH
APP
Frontal Lobe
Temporal
Lobe
Hippocampu
s
Cerebral
Cortex
Astrocytes
Basal
Nucleus of
Meynert
Cerebrum
Brain
Brain and
Nervous
System
A2M
APOE
Microglia
Alzheimer Disease
ALOX12 ABCA1
ABCA2
NME1
Neurons
NEF3
PARK2
STH
APP
Many Diseases – Many Genes
Frontal Lobe
Hippocampu
s
Temporal
Lobe
Cerebral
Cortex
Astrocytes
Basal
Nucleus of
Meynert
Cerebrum
Brain
Microglia
Alzheimer Disease
Brain and
Nervous
System
PARK3
A2M
PARK7
PARP
Parkinson Disease
APOE
Neurons
ABCA2
STH
APP
ALOX12
ABCA1
SCZD2
NME1
SCZD8
Schizophrenia
SCZD3
NEF3
PARK2
Genes: Functions &
Pathways
Frontal Lobe
Temporal
Lobe
Hippocampus
Cerebral
Cortex
Astrocytes
Basal
Nucleus of
Meynert
Cerebrum
Brain
Brain and
Nervous
System
A2M
Microglia
Alzheimer Disease
APOE ALOX12 ABCA1 ABCA2
→enzyme binding
→extracellular space
Functions/
→interleukin-1 binding Processes
→interleukin-8 binding
→intracellular protein transport
→protein carrier activity
→protein homooligomerization
→serine-type endopeptidase inhibitor activity
→tumor necrosis factor binding
→wide-spectrum protease inhibitor activity
NME1
Neurons
NEF3
PARK2
STH
Alzheimer's disease (Kegg)
Neurodegenerative Disorders (Kegg)
Deregulation of CDK5 in Alzheimers
Disease (BioCarta)
Generation of amyloid b-peptide by
PS1 (BioCarta)
Platelet Amyloid Precursor Protein
Pathway (BioCarta)
Pathways
Hemostasis (Reactome)
APP
Protein Interactions
Frontal Lobe
Temporal
Lobe
Hippocampu
s
Cerebral
Cortex
Astrocytes
Basal
Nucleus of
Meynert
Cerebrum
Brain
Microglia
Alzheimer Disease
Brain and
Nervous
System
A2M
APOE
C1QBP
ALOX12 ABCA1
KLKB1
KNG1
ABCA2
NS5A
CNTF
NME1
Neurons
NEF3
PARK2
APP
STH
APPBP1
TGFB2
Understanding the genetic network of human Alzheimer’s
disease - Two general phases
1.
Identifying the genetic players involved
2. Systematically perturbing individual players and/or pathways suspect
of being involved in neurodegenerative diseases of model organisms
(e.g. knock-outs)
Computational Approaches
•
Data-mining (Data marts):
Comparative Genomics,
Interactome, Comparative
Phenomics, Regulomics
(TFBSs, motif/pattern
search)
•
Text-mining: Literature
mining (hypothesisgenerator)
•
Mathematical Modeling:
Disease process modeling
Experimental Approaches
•
Genetic Manipulations
•
Gene Expression Studies
•
Animal Models
•
Cellular Studies (to
investigate specific cellular
processes)
Cellular Studies
Gene
Expression
Clustering
Algorithms
Model Organisms &
Genetic
Manipulations
Differentially
expressed genes
Models of human
neurodegenerative diseases
Comparative Genomics
Alzheimer
Disease Related
Genes
Transcriptome
Proteomics
Transcriptional Regulation
Post-Transcriptional Regulation - MicroRNAs
Text-mining: Knowledge Discovery
Genomics
NCBI Entrez Gene Query:
(alzheimer[Disease/Phenotype] OR alzheimer[All fields]) AND "homo sapiens"[Organism]
143 Genes
A2M
CD40
FAS
ABCA1
MME
RABGAP1L
CDC2
FASLG
ABCA2
MPO
RTN4
CDK5
FRAP1
ABCB1
MRE11A
SERPINA3
CDK5R1
FYN
ABL1
MSI1
SFRS12
CDK5R2
GABBR1
ACE
MTRR
SLC1A2
CHAT
GAL
AD5
NACA
SLC6A3
CHRNA4
GAPDH
AD6
NCAM1
SLC6A4
CHRNA7
GFAP
AD7
NCSTN
SNCB
CLU
GRIA1
AD8
NDRG2
SORL1
COL18A1
GRIA2
AD9
NES
TFAM
COL25A1
GRIA3
ADAM10
NGFR
TGFB1
COX10
GRIN2A
AGER
NME1
TNF
CRH
GRIN2B
AHSG
NME2
TUBB3
CTCF
GSK3B
APBA1
NOS3
UBQLN1
CTNNA3
HADH2
APBB2
NRG1
VSNL1
CTSB
HPCAL1
APH1A
OLR1
CTSD
HTR2A
APOC1
P18SRP
CXCR3
IDE
APOD
PARK7
CYP46A1
IFNG
APOE
PAXIP1
DHCR24
IGF2R
APOM
PCSK1
DLST
IL1B
APP
PCSK2
DSCR1
ITM2B
ASAHL
PCSK9
E2F1
KCNC4
ATF2
PIN1
EEF2
KLK10
BACE1
PLAU
EEF2K
KLK7
BACE2
PON1
EIF2AK2
LAMA1
BAX
PRDX1
EIF4E
LAMC1
BCHE
PRDX2
EIF4EBP1
LOC644264
BCL2
PRDX3
ENO1
LRP8
BCL2L2
PRNP
ERBB4
MAP2K1
BLMH
PSEN1
ESR1
MAPT
CBS
PSEN2
FALZ
MEOX2
RPS3A
Mining Interactome
Pathways (top 10)
Molecular & Cellular
Functions (top 10)
Physiological System
Development & Function
(top 10)
Y-axis represents significance - probability that the genes
within the dataset file are involved in a particular high level
function (Ingenuity Analysis)
NCBI Entrez Gene Query:
(alzheimer[Disease/Phenotype] OR alzheimer[All fields]) AND "homo sapiens"[Organism]
143 AD-associated genes
Mining about 800 gene expression datasets
http://depts.washington.edu/l2l/
Text-mining MedLine Abstracts
• Data Source: GeneRIF – Gene reference
into function – Manually entered/curated
sentences.
• GeneRIF: “Abstract of Abstracts”
• NLP - MetaMap and GATE (General
Architecture for Text Engineering)
• Keywords: MESH and UMLS concepts for
Alzheimer’s disease (AD, Alzheimer’s
dementia, Alzheimer disease, etc.)
299 unique genes associated with Alzheimer’s disease
GATACA – Gene Association To Anatomy & Clinical Abnormality
299 genes associated with Alzheimer's Disease (based on text-mining Medline abstracts)
Entrez
GENE
ID
GENE
SYMBOL
SENTENCE
PubMed_ID
Genetic association of alpha2-macroglobulin
polymorphisms with Alzheimer's disease
12221172
Deposition of Alzheimer beta amyloid is inversely
correlated with expression of this protein in the brains of
elderly non-demented humans.
12360104
153 ADRB1
Single-nucleotide polymorphisms (SNPs) in the beta1adrenergic receptor (ADRB1) allelic frequencies were
analyzed in Alzheimer's disease. The combination of G
protein beta3 subunit and ADRB1 polymorphisms
produces AD susceptibility.
15212839
239 ALOX12
12/15-lipoxygenase is increased in Alzheimer's disease
and has a possible role in brain oxidative stress
15111312
246 ALOX15
12/15-lipoxygenase is increased in Alzheimer's disease
and has a possible role in brain oxidative stress
15111312
Associated with etiological mechanism of Alzheimer's
disease.
11831025
2 A2M
5243 ABCB1
9546 APBA3
299 genes associated with Alzheimer’s disease: Comparison with genes
differentially expressed in Alzheimer’s and ageing frontal cortex
List
alzheimers_
disease_dn
alzheimers_
disease_up
ageing_brain
_up
ageing_brain
_dn
total
probes
expected
actual
PMID
description
14769913
*
Downregulated in
correlation with overt
Alzheimer's Disease, in the
CA1 region of the
hippocampus
1222
11.08886
49
2.83E-17
*
Upregulated in correlation
with overt Alzheimer's
Disease, in the CA1 region
of the hippocampus
1665
15.1088
53
1.82E-14
15190254
**
Age-upregulated in the
human frontal cortex
252
2.286737
19
3.67E-12
15190254
**
Age-downregulated in the
human frontal cortex
145
1.315781
13
1.07E-09
14769913
bin prob
*Lu T, Pan Y, Kao SY, Li C, Kohane I, Chan J, Yankner BA. 2004. Gene regulation and DNA damage in
the ageing human brain. Nature 429(6994): 883-891.
** Blalock EM, Geddes JW, Chen KC, Porter NM, Markesbery WR, Landfield PW. 2004. Incipient
Alzheimer's disease: microarray correlation analyses reveal major transcriptional and tumor
suppressor responses. Proc Natl Acad Sci U S A. 101 (7): 2173-2178.
http://depts.washington.edu/l2l/
CNS-overexpressed genes in adult human and/or mouse
Human CNS
Human non-CNS
Mouse non-CNS
Mouse CNS
A 940 gene ortholog pairs over-
expressed in both human and mouse
CNS
B 206 gene ortholog pairs over-
expressed in human, not mouse CNS
C 266 gene ortholog pairs over-
expressed in mouse, not human CNS
Kong and Jegga, unpublished
APP
299 genes associated
with Alzheimer’s
disease – Literature
mining
220
28
1222 genes
downregulated in
Alzheimer’s
CAMK2A
CDK5
CDK5R1
CHGA
865
21
30
ARPP-19
308
581
940 human-mouse
orthologous genes
overexpressed in CNS
How many
of these are
involved in
CNS
development
or function
– From GO
CKB
GLUL
GNAS
GRIA3
KNS2
MAP2K1
MAPK1
MAPK8IP1
PCSK1
PRDX2
RGS4
SNCA
UCHL1
VSNL1
YWHAZ
http://concise-scanner.cchmc.org
To identify putative
gene targets of
transcription factors
Sequence Context
List of Transcription
Factor Binding Sites
Human
Mouse
GenomeTrafac
Coordinates
Genome Assembly
Coordinates
Conserved binding sites
between human and mouse
GenomeTrafac
Tracks
•
PDEF is an ETS
transcription
factor expressed
in prostate
epithelial cells.
•
Nkx3.1 interacts
with SPDEF or
Prostate derived
Ets factor.
Gnf
Expression
Atlas Human
Prostate
Trachea & bronchial epithelial cells
http://polydoms.cchmc.org
Goals – Summary………
• Enable discovery of novel disease-gene
relationships
• Facilitate discovery of disease-pathway
relationships
• Enable discovery of novel pathways and
targets and associate them with disease
processes
• Help researchers generate testable
hypotheses
• Support efforts to prioritize research
• Facilitate meta-analyses
New/Future Directions…….
Computational
• Semantic Web (SW): “A vision
for the next generation web in
which data from multiple sources
described with rich semantics are
integrated to enable human
processing by humans as well as
software agents” (SW Life
Sciences)
• Semantic Web Languages
– RDF (Resource Description
Framework)
– RDFS (RDF schema) and
– OWL (Ontology Web Language)
– SPARQL (semantic web querying
language)
• Prioritization and Ranking
entities on novel Gene Networks
and Inferencing
Biological/Genomics
• Gene regulation by microRNAs
(miRNAs):
– ~22 bp non-coding nucleotide
RNAs that primarily act posttranscriptionally by suppressing
mRNAs
– At least 1% of the transcripts
in the genome code for miRNAs
– miRNA have at least 20-30% of
the coding genes as their
targets
– miRNAs are implicated in various
cellular processes, such as cell
fate determination, cell death,
and tumorigenesis (Bartel 2004).
– E.g.: CREB-regulated miRNA
regulates neuronal
morphogenesis (Vo et al 2005)
Take-home messages
• Networks and integration of databases are
keys to success in Bioinformatics.
• Integration of data computation and data
integration into a single cohesive whole will
increase the efficiency of research effort
– by reducing the serendipity & hit and miss nature of
empirical research and
– will provide valuable clues to the biomedical
researchers on their choice of experiments limitations of funds, manpower and time.
• Researchers/Users have to know what is
available and how to access (what are the
limitations), and use the resources they are
offered or are available.
the Ultimate Goal…….
Disease
World
Medical Informatics
Bioinformatics
Genome
Regulome
Personalized Medicine
►Decision Support System
►Outcome Predictor
→Name
►Course Predictor
→Synonyms
Diagnostic Test Selector
→Related/Similar►Diseases
→Subtypes
►Clinical Trials Design
→Etiology
►Hypothesis Generator…..
→Predisposing Causes
Disease
Databas
e
Patient
Record
s
Clinical
Trials
►
→Pathogenesis
→Molecular Basis
→Population Genetics
→Clinical findings
→System(s) involved
→Lesions
→Diagnosis
→Prognosis
→Treatment
→Clinical Trials……
Integrative
Genomics Biomedical
Informatics
OMIM
Transcriptome
Proteome
Interactome
Metabolome
Physiome
Pathome
Variome
Pharmacogenome
PubMed
http://anil.cchmc.org
(under presentations)
“To him who devotes his life to science, nothing can give more happiness
than increasing the number of discoveries, but his cup of joy is full when the
results of his studies immediately find practical applications”
Thank You!
— Louis Pasteur
http://sbw.kgi.edu/
Download