Haiyan Huang
May 12, 2010
Credits of some slides:
Atul Butte, Stanford U
Russ Altman, Stanford U
• DNA level
– gene annotation, functional elements identification
• mRNA level
– transcription regulation, gene coexpression, pathways
• Protein level
– protein structure/function; proteinprotein interaction
• System Biology
– complex interactions in biological system
• Translational bioinformatics
– integrative information retrieval for aiding disease diagnosis
Central Dogma of Biology
• ( by Russ Altman, MD, PhD ) Using the toolkit of bioinformatics to understand the relationship of molecular information to diseases and symptoms, in order improve diagnosis, prognosis, and therapy.
( by Atul Butte, MD, PhD )
• Translational bioinformatics
– Development of analytic, storage, and interpretive methods
– Optimize the transformation of increasingly voluminous genomic and biological data into diagnostics and therapeutics for the clinician
( Research on the development of novel techniques for the integration of biological and clinical data)
• End product of translational bioinformatics
– Newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders, including biomedical scientists, clinicians, and patients
• There is an increasing call for translational medicine:
Universities, Congress, NIH, and elsewhere: “What did we get for our money?”
• Incredible amounts of publicly-available data
– GenBank: Hundreds of organisms have been completely sequenced
– GEO, ArrayExpress has numerous samples from thousands of experiments
– NCBI dbGAP for the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, association between genotype and non-clinical traits, etc
• Using high-throughput genomic measurements for improved diagnosis/prognosis/therapy
– New classifications of disease based on molecular markers
– Identify new drug targets based on molecular profiling of disease
• Understanding disease pathology and genetic pathways in complex multigenic disorders
– Create systems for physician decision support using genetic information
Transforming Public Gene Expression
Repositories into a Disease Diagnosis
Database
Reference:
Huang H, Liu C, Zhou XJ (2010). Bayesian Approach to Transforming
Public Gene Expression Repositories into Disease Diagnosis
Databases. Proc Natl Acad Sci.USA
. 2010 Apr 13;107(15):6823-8.
Transforming public expression repositories into a disease diagnosis database
• The public microarray data increases by 1.5 folds per year
– NCBI Gene Expression Omnibus (GEO): > 330,000 experiments
– EBI Array Express: > 115000 experiments wide molecular basis of diseases
• heart disease, mental illness, infectious disease, and a wide variety of cancers.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Transforming public expression repositories into a disease diagnosis database
• The public microarray data increases by 1.5 folds per year
– NCBI Gene Expression Omnibus (GEO): > 270000 experiments
– EBI Array Express: > 115000 experiments
• An unprecedented opportunity to study human diseases
• Expression-based-diagnosis would be particularly useful when the potential disease is not obvious or when the disease lacks biochemical diagnostic tests
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Thus far, no effective method is available for this purpose. Existing approaches have been
– of limited scale, i.e., within single laboratories,
– targeting specific types of disease,
– lacking the integration of the heterogeneous datasets
(i.e., from different experimental sources, with diverse phenotypes, containing the information in different formats ).
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Integrating public repositories involves “combining the expression data and the text information”
Expression data
Microarray Data
Phenotype information
Phenotype Concepts (e.g. diseases, perturbations, tissues ) in Unified Medical Language System (UMLS)
• The gene expression data from different laboratories cannot be compared directly due to platform differences and systematic variation
• The disease and phenotype annotations of datasets are heterogeneous and embedded in text, and thus not in a workable format
• The disease diagnosis approach must robustly characterize a query expression profile by jointly utilizing the large amount of noisy genomic and phenotypic data
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
…
Adenocarcinoma Asthma Glaucoma
1. We initially collected 421 human microarrray datasets of the platform U95, U133 and U133 plus 2 from NCBI GEO.
• These three major platforms share a large number of overlapping genes (8,358 genes)
2. We further selected 100 datasets ( 9169 arrays) having subset types:
• disease state
• normal, control, non-tumor, healthy, or benign
This serves as the initial database to test our framework
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
– Phenotypically annotating the collected human microarray experiments by Unified Language Medical
System (UMLS) (challenge 2).
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
Log-rank-ratio
Disease profile
Control profile standardized profile
– Phenotypically annotating the collected human microarray experiments by Unified Language Medical
System (UMLS) (challenge 2).
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
Log-rank-ratio
Disease profile
Control profile standardized profile
– Phenotypically annotating the collected human microarray experiments by Unified Language Medical
Are the cross-dataset comparisons by
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
(
Figure. (a) Scatterplot of expression profiles on the same samples GSM21236
GDS817: Breast Cancer cells MDA-MB-436 ) versus GSM21242 ( GDS820: Breast
Cancer cells MDA-MB-436 ); (b) Scatterplot of expression rank profiles of the same samples as in (a); (c) Scatterplot of expression ratio profiles GSM21240 / GSM21236
(from GDS817) versus GSM21246 / GSM21242 (from GDS820) in log scale; (d)
Scatterplot of expression rank ratio profiles of the same sample pairs as in (c) in log scale. It is obvious that log rank ratio profiles of the same sample pairs are comparable versus GSM31127/GSM31117 (in GDS854).
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Red Curve: different labs, same type of biological samples , same platform
(Pearson correlations between GDS1372 and GDS1665)
Blue curve: different labs, different biological samples , same platform
(Pearson correlations between GDS1665 and GDS1917)
Before
Standardization
After
Standardization
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Data Preparation
– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).
– Phenotypically annotating the collected human microarray experiments by Unified Language Medical
System (UMLS) concepts (challenge 2).
• UMLS also provides the language processing tool MetaMap to enable the automated mapping of text onto UMLS concepts
• Processing the metadata has been one of the major efforts in recent Translational Bioinformatics research
• Bayesian disease inference (challenge 3)
• Bayesian belief network for refining the results
(challenge 3)
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• UMLS consists of three major components:
– Metathesaurus : > 1 million biomedical concepts from over 100 data sources.
– Semantic Network : defines relationships between concepts.
– Lexical resources : natural language processing tool. It can process text into the concepts.
• Cluster synonymous terms into a single UMLS concept
• Choose the preferred term
• Assign the unique identifier
Addison's disease
Addison's Disease
Addison Disease
Bronzed Disease
SNOMED CT
MedlinePlus
Deficiency; corticorenal, primary
Primary Adrenal Insufficiency
ICPC2-ICD10
MeSH
Primary hypoadrenalism syndrome, Addison MedDRA
363732003
T1233
MeSH D000224
SNOMED Intl 1998 DB-70620
MTHU021575
D000224
10036696
C0001403 Addison's disease
Dataset UMLS annotation construction http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=563
1. Take summary in dataset
2. Take PMID
3. Download MeSH headings from PubMed by PMID as follows:
… etc.
C0242692 Skeletal muscle structure
C0027868 Neuromuscular Diseases
… etc.
(1) Nervous system disorder
(2) Neuromuscular Diseases
(3) Myopathy
(4) Musculoskeletal Diseases
(5) Congenital, Hereditary, and Neonatal Diseases and Abnormalities
(6) Genetic Diseases, Inborn
(7) Genetic Diseases, X-Linked
(8) Muscular Disorders, Atrophic
(9) Muscular Dystrophies
(10) Muscular Dystrophy, Duchenne
Table 1.
the phenotype annotation set for the dataset NCBI GEO GDS563
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Aronson, A.R. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp: 17-21.
• Has been one of the major efforts in recent
Translational Bioinformatics research
– Butte AJ, Kohane IS. (2006) Creation and implications of a phenome-genome network. Nat Biotechnol.
2006
Jan;24(1):55-62.
– Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen R, Musen
MA. (2009) Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics. 2009 Feb
5;10 Suppl 2:S1.
To integrate the large amount of data on various diseases to build a diagnosis database such that users can rapidly search the disease profiles for expression similarities and further for disease annotations to the query sample of interest.
We consider our disease diagnosis question as a classification problem , where each UMLS concept represents an individual class, and all of the classes are organized in a hierarchy.
The outcome of our analysis is to categorize a standardized query dataset into several classes in the hierarchy. This type of general setting is a socalled hierarchical multilabel classification
(HMC) in the machine learning field.
We aim to infer P Q | s x ,1
,..., s , e
, , 1, k
,..., e ).
Difficulties include:
•
Database phenotype-
The association strength between different microarrays to the same group M disease class can vary greatly;
• The distribution of the similarity scores s x,i is non-standard.
( | s
, x ,1
,..., s , e
, , 1, k
,..., e ) model
P ( s x ,1
,..., s | Q , e
1, k
,..., e ) or
P ( s x ,1
,..., s | Q , T
1, k
,..., T P T
1, k
,..., T | e
1, k
,..., e )
Instead of modeling the model the ratio
P
Alter
and P
Null
, we choose to
P ( S
Alter i x
,1
,..., S x i
)
,
P S
Null i x
S x
( ,..., )
,1 i guided by the following properties:
When all the rest are the same,
1. Larger means of scores should give larger ratios;
2. Bigger variances should give larger ratios;
3. Less skewness and kurtosis should give smaller ratios. log(
P
Alter
S i x
S x
(
,1
,..., ) i )
P S
Null i x
S x
( ,..., )
,1 i
C
0
C
1
* Mean( S i x
,1
,..., S x i
)
C
2
* Var( S i x
,1
,..., S x i
)
C
3
* Skewness( S i x
,1
,..., S x i
), where Skewness
n
(
1)
2
m
3 m 3/ 2
2
. Note that m r r
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
For a query profile x , we will diagnose it with UMLS concept U k if
S x
1,1
S x
1, n
1
S x
1 | ,..., ,..., ,...,
M ,1
1) x
S , e ,..., )
M
1, k e
In our study, we set λ to be 4.5.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
1. The UMLS concepts A, B, C, D, …, I are organized in a directed acyclic graph, with each node corresponds to a concept.
2. Given a query profile, and for each of the concept, we use A^, B^, … to denote the obtained bayesian annotation.
We note that for some nodes, the first round prediction is missing.
By associating (conditional) probabilities with the DAG, we formulate our problem as a Bayesian
Belief Network (BBN).
We implemented a popular exact inference method, variable elimination, to infer
( , , , ,..., | ^, ^,..., ^, DAG )
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
(1) Nervous system disorder
(2) Neuromuscular Diseases
(3) Myopathy
(4) Musculoskeletal Diseases
(5) Congenital, Hereditary, and Neonatal Diseases and Abnormalities
(6) Genetic Diseases, Inborn
(7) Genetic Diseases, X-Linked
(8) Muscular Disorders, Atrophic
(9) Muscular Dystrophies
(10) Muscular Dystrophy, Duchenne
Table 1.
the phenotype annotation set for the dataset NCBI GEO GDS563
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Case study GDS2251: Comparison of myeloid
Bayesian prediction:
Precision 59.1% recall 92.9% leukemia cells to normal monocytes.
2. monocyte
• Subset 1: normal
Chromosome abnormality and translocation are highly related to myeloid leukemia.
• Subset 2: myeloid leukemia
5. Myeloid Leukemia original set of Bayesian annotations but “Bone Marrow
1. Leukocytes, Mononuclear
Marrow” should be a correct
12 3
15
17
19
20
8 8
16
23 18
9 9
14 21
22
22. Myeloproliferative disease
23. granulocyte
…etc.
Our method achieved an overall accuracy of 95%
(precision 82% and recall 20%)
The problem is analogous to other biological hierarchical multilabel classification problem, such as gene function prediction, which has achieved the best performance in mouse model at the recall rate of 20% and precision rate 41% (L. Pena-
Castillo et al.
, Genome Biol , 2008).
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Further accumulation of datasets would increase the power of our method.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• Power of this type of approach will increase daily with the continuous and rapid accumulation of genomics data in the public repositories.
• Our diagnosis system is also promising in its potential to reveal unexpected disease connections, and further to construct novel phenome networks.
• Better tools for text-mining are needed!
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
Drugdisease connectivity map
• Power of this type of approach will increase daily with the continuous and rapid accumulation of genomics data in the public repositories.
• Our diagnosis system is also promising in its potential to reveal unexpected disease connections, and further to construct novel phenome networks.
• Better tools for text-mining are needed!
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
UMLS concepts:
Bone Marrow Cells
Dysmyelopoietic Syndromes
Hematopoietic stem cells
Immunoproliferative Disorders
Leukemia, Myelocytic, Acute
Lymphoproliferative Disorders
Myeloid Leukemia
Nonlymphocytic Leukemia, Acute
Stem cells leukemia
The authors’ writing style affects the UMLS annotations.
Joint work with X.J. Zhou (USC), J.C. Liu (USC)
• To further improve the method prediction power (Ci-Ren Jiang)
– (focusing on the second stage) Applying a different way for a more thorough collaborative error correction along the disease hierarchy
– Previous Bayesian Belief Network model is time consuming and only allows one-way information exchange
To integrate the large amount of data on various diseases to build a diagnosis database such that users can rapidly search the disease profiles for expression similarities and further for disease annotations to the query sample of interest.
We consider our disease diagnosis question as a classification problem , where each UMLS concept represents an individual class, and all of the classes are organized in a hierarchy.
The outcome of our analysis is to categorize a standardized query dataset into several classes in the hierarchy. This type of general setting is a socalled hierarchical multilabel classification
(HMC) in the machine learning field.
• Focusing on particular disease diagnosis by collaborating with specialized medical doctors/researchers, e.g., comparing bipolar disorder vs schizophrenia (Wayne Lee; Dr. Fei
Wang, MD, PhD, Psychiatry, Yale University)
– To investigate the usefulness/effectiveness of highthroughput molecular information in distinguishing between bipolar disorder and schizophrenia
– To compare existing clinical predictors with predictors from microarray data
– Dr. X. Jasmine Zhou (Molecular and Computational
Biology, USC)
– Dr. Jim Chun-Chih Liu (Molecular and Computational
Biology, USC)
– Dr. Ming-Chih Kao (Stanford Hospital, PhD, MD)
– Dr. Ci-Ren Jiang (UCB)
– Wayne Lee (UCB)
– Dr. Fei Wang (Yale U)
• Start with independently trained classifiers
(hard-margin linear SVMs) for each class without thresholding the outputs
• Assume the aggregate classifier outputs to have
Gaussian distributions for positive and negative examples
• Design a Bayesian hierarchical combination scheme to allow collaborative error-correction over all nodes
• Apply predictive clustering tree (PCT) to hierarchical multi-label classification
• The example labels are represented with
Boolean components
• Weighted Euclidean distance is used to measure similarities
• The class weights decrease with the depth of the class in the hierarchy
• Zafer Barutcuoglu, Robert E. Schapire and Olga
G. Troyankaya (2006), “Hierarchical multi-label prediction of gene function,” Bioinformatics 22,
830-836
• Celine Vens, Jan Struyf, Leander Schietgat,
Saso Dzeroski, and Hendrik Blockeel (2008),
“Decision trees for hierarchical multi-label classification,” Machine Learning 73, 185-214
Bioinformatics is an interdisciplinary research area which may be defined as the interface between biological and mathematical
(i.e., math, statistics, computing) sciences.
In bioinformatics research,
• Biological data are often high dimensional, complex and noisy, e.g.,
– NCBI GEO contains the microarray data produced by thousands of research teams (cross-platform/laboratory variations )
– Some investigations may rely upon a variety of experimental technologies (heterogeneous data types)
• The demands on statisticians are substantial
– precise understanding of the underlying biological principles
– strong analytical, modelling and data manipulation skills