H. Huang Research Group in Computational Biology

advertisement

Translational Bioinformatics

Haiyan Huang

May 12, 2010

Credits of some slides:

Atul Butte, Stanford U

Russ Altman, Stanford U

Background: bioinformatics in the postgenomic era

• DNA level

– gene annotation, functional elements identification

• mRNA level

– transcription regulation, gene coexpression, pathways

• Protein level

– protein structure/function; proteinprotein interaction

• System Biology

– complex interactions in biological system

• Translational bioinformatics

– integrative information retrieval for aiding disease diagnosis

Central Dogma of Biology

What is Translational

Bioinformatics?

• ( by Russ Altman, MD, PhD ) Using the toolkit of bioinformatics to understand the relationship of molecular information to diseases and symptoms, in order improve diagnosis, prognosis, and therapy.

What is translational bioinformatics?

( by Atul Butte, MD, PhD )

• Translational bioinformatics

– Development of analytic, storage, and interpretive methods

– Optimize the transformation of increasingly voluminous genomic and biological data into diagnostics and therapeutics for the clinician

( Research on the development of novel techniques for the integration of biological and clinical data)

• End product of translational bioinformatics

– Newly found knowledge from these integrative efforts that can be disseminated to a variety of stakeholders, including biomedical scientists, clinicians, and patients

Why translational bioinformatics?

• There is an increasing call for translational medicine:

Universities, Congress, NIH, and elsewhere: “What did we get for our money?”

• Incredible amounts of publicly-available data

– GenBank: Hundreds of organisms have been completely sequenced

– GEO, ArrayExpress has numerous samples from thousands of experiments

– NCBI dbGAP for the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, association between genotype and non-clinical traits, etc

Example I

Example II

Example III

Translational Bioinformatics

Tasks

• Using high-throughput genomic measurements for improved diagnosis/prognosis/therapy

– New classifications of disease based on molecular markers

– Identify new drug targets based on molecular profiling of disease

• Understanding disease pathology and genetic pathways in complex multigenic disorders

– Create systems for physician decision support using genetic information

Transforming Public Gene Expression

Repositories into a Disease Diagnosis

Database

Reference:

Huang H, Liu C, Zhou XJ (2010). Bayesian Approach to Transforming

Public Gene Expression Repositories into Disease Diagnosis

Databases. Proc Natl Acad Sci.USA

. 2010 Apr 13;107(15):6823-8.

Example project in translational bioinformatics

Transforming public expression repositories into a disease diagnosis database

• The public microarray data increases by 1.5 folds per year

– NCBI Gene Expression Omnibus (GEO): > 330,000 experiments

– EBI Array Express: > 115000 experiments wide molecular basis of diseases

• heart disease, mental illness, infectious disease, and a wide variety of cancers.

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Example project in translational bioinformatics

Transforming public expression repositories into a disease diagnosis database

• The public microarray data increases by 1.5 folds per year

– NCBI Gene Expression Omnibus (GEO): > 270000 experiments

– EBI Array Express: > 115000 experiments

• An unprecedented opportunity to study human diseases

• Expression-based-diagnosis would be particularly useful when the potential disease is not obvious or when the disease lacks biochemical diagnostic tests

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Our goal: turning the NCBI GEO into an automated disease diagnosis system

• Thus far, no effective method is available for this purpose. Existing approaches have been

– of limited scale, i.e., within single laboratories,

– targeting specific types of disease,

– lacking the integration of the heterogeneous datasets

(i.e., from different experimental sources, with diverse phenotypes, containing the information in different formats ).

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Integrating public repositories involves “combining the expression data and the text information”

Expression data

Microarray Data

Phenotype information

Phenotype Concepts (e.g. diseases, perturbations, tissues ) in Unified Medical Language System (UMLS)

Three challenges towards our goal

• The gene expression data from different laboratories cannot be compared directly due to platform differences and systematic variation

• The disease and phenotype annotations of datasets are heterogeneous and embedded in text, and thus not in a workable format

• The disease diagnosis approach must robustly characterize a query expression profile by jointly utilizing the large amount of noisy genomic and phenotypic data

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Collecting microarray data sets

Adenocarcinoma Asthma Glaucoma

1. We initially collected 421 human microarrray datasets of the platform U95, U133 and U133 plus 2 from NCBI GEO.

• These three major platforms share a large number of overlapping genes (8,358 genes)

2. We further selected 100 datasets ( 9169 arrays) having subset types:

• disease state

• normal, control, non-tumor, healthy, or benign

This serves as the initial database to test our framework

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Method highlights

• Data Preparation

– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).

– Phenotypically annotating the collected human microarray experiments by Unified Language Medical

System (UMLS) (challenge 2).

• Bayesian disease inference (challenge 3)

• Bayesian belief network for refining the results

(challenge 3)

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Method highlights

• Data Preparation

– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).

Log-rank-ratio

Disease profile

Control profile standardized profile

– Phenotypically annotating the collected human microarray experiments by Unified Language Medical

System (UMLS) (challenge 2).

• Bayesian disease inference (challenge 3)

• Bayesian belief network for refining the results

(challenge 3)

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Method highlights

• Data Preparation

– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).

Log-rank-ratio

Disease profile

Control profile standardized profile

– Phenotypically annotating the collected human microarray experiments by Unified Language Medical

Are the cross-dataset comparisons by

• Bayesian disease inference (challenge 3)

• Bayesian belief network for refining the results

(challenge 3)

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Data Standardization: an example

(

Figure. (a) Scatterplot of expression profiles on the same samples GSM21236

GDS817: Breast Cancer cells MDA-MB-436 ) versus GSM21242 ( GDS820: Breast

Cancer cells MDA-MB-436 ); (b) Scatterplot of expression rank profiles of the same samples as in (a); (c) Scatterplot of expression ratio profiles GSM21240 / GSM21236

(from GDS817) versus GSM21246 / GSM21242 (from GDS820) in log scale; (d)

Scatterplot of expression rank ratio profiles of the same sample pairs as in (c) in log scale. It is obvious that log rank ratio profiles of the same sample pairs are comparable versus GSM31127/GSM31117 (in GDS854).

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Standardized profiles are consistent with the phenotype annotations

Red Curve: different labs, same type of biological samples , same platform

(Pearson correlations between GDS1372 and GDS1665)

Blue curve: different labs, different biological samples , same platform

(Pearson correlations between GDS1665 and GDS1917)

Before

Standardization

After

Standardization

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Method highlights

• Data Preparation

– Standardizing the expression data to remove crosslab and cross-platform incompatibilities (challenge 1).

– Phenotypically annotating the collected human microarray experiments by Unified Language Medical

System (UMLS) concepts (challenge 2).

• UMLS also provides the language processing tool MetaMap to enable the automated mapping of text onto UMLS concepts

• Processing the metadata has been one of the major efforts in recent Translational Bioinformatics research

• Bayesian disease inference (challenge 3)

• Bayesian belief network for refining the results

(challenge 3)

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Unified Medical Language

System (UMLS)

• UMLS consists of three major components:

– Metathesaurus : > 1 million biomedical concepts from over 100 data sources.

– Semantic Network : defines relationships between concepts.

– Lexical resources : natural language processing tool. It can process text into the concepts.

UMLS Metathesaurus

• Cluster synonymous terms into a single UMLS concept

• Choose the preferred term

• Assign the unique identifier

Addison's disease

Addison's Disease

Addison Disease

Bronzed Disease

SNOMED CT

MedlinePlus

Deficiency; corticorenal, primary

Primary Adrenal Insufficiency

ICPC2-ICD10

MeSH

Primary hypoadrenalism syndrome, Addison MedDRA

363732003

T1233

MeSH D000224

SNOMED Intl 1998 DB-70620

MTHU021575

D000224

10036696

C0001403 Addison's disease

Dataset UMLS annotation construction http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds_browse.cgi?gds=563

1. Take summary in dataset

2. Take PMID

3. Download MeSH headings from PubMed by PMID as follows:

… etc.

C0242692 Skeletal muscle structure

C0027868 Neuromuscular Diseases

… etc.

(1) Nervous system disorder

(2) Neuromuscular Diseases

(3) Myopathy

(4) Musculoskeletal Diseases

(5) Congenital, Hereditary, and Neonatal Diseases and Abnormalities

(6) Genetic Diseases, Inborn

(7) Genetic Diseases, X-Linked

(8) Muscular Disorders, Atrophic

(9) Muscular Dystrophies

(10) Muscular Dystrophy, Duchenne

Table 1.

the phenotype annotation set for the dataset NCBI GEO GDS563

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Processing the metadata

• Aronson, A.R. (2001) Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp: 17-21.

• Has been one of the major efforts in recent

Translational Bioinformatics research

– Butte AJ, Kohane IS. (2006) Creation and implications of a phenome-genome network. Nat Biotechnol.

2006

Jan;24(1):55-62.

– Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen R, Musen

MA. (2009) Ontology-driven indexing of public datasets for translational bioinformatics. BMC Bioinformatics. 2009 Feb

5;10 Suppl 2:S1.

To integrate the large amount of data on various diseases to build a diagnosis database such that users can rapidly search the disease profiles for expression similarities and further for disease annotations to the query sample of interest.

We consider our disease diagnosis question as a classification problem , where each UMLS concept represents an individual class, and all of the classes are organized in a hierarchy.

The outcome of our analysis is to categorize a standardized query dataset into several classes in the hierarchy. This type of general setting is a socalled hierarchical multilabel classification

(HMC) in the machine learning field.

Building Bayesian classifier for each disease class

We aim to infer P Q | s x ,1

,..., s , e

, , 1, k

,..., e ).

Difficulties include:

Database phenotype-

The association strength between different microarrays to the same group M disease class can vary greatly;

• The distribution of the similarity scores s x,i is non-standard.

( | s

, x ,1

,..., s , e

, , 1, k

,..., e ) model

P ( s x ,1

,..., s | Q , e

1, k

,..., e ) or

P ( s x ,1

,..., s | Q , T

1, k

,..., T P T

1, k

,..., T | e

1, k

,..., e )

Log-linear regression

Instead of modeling the model the ratio

P

Alter

and P

Null

, we choose to

P ( S

Alter i x

,1

,..., S x i

)

,

P S

Null i x

S x

( ,..., )

,1 i guided by the following properties:

When all the rest are the same,

1. Larger means of scores should give larger ratios;

2. Bigger variances should give larger ratios;

3. Less skewness and kurtosis should give smaller ratios. log(

P

Alter

S i x

S x

(

,1

,..., ) i )

P S

Null i x

S x

( ,..., )

,1 i

C

0

C

1

* Mean( S i x

,1

,..., S x i

)

C

2

* Var( S i x

,1

,..., S x i

)

C

3

* Skewness( S i x

,1

,..., S x i

), where Skewness

 n

(

1)

2

 m

3 m 3/ 2

2

. Note that m r r

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Making the annotation predictions

For a query profile x , we will diagnose it with UMLS concept U k if

S x

1,1

S x

1, n

1

S x

1 | ,..., ,..., ,...,

M ,1

1) x

S , e ,..., )

M

1, k e

 

In our study, we set λ to be 4.5.

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Refining the inferred annotations

1. The UMLS concepts A, B, C, D, …, I are organized in a directed acyclic graph, with each node corresponds to a concept.

2. Given a query profile, and for each of the concept, we use A^, B^, … to denote the obtained bayesian annotation.

We note that for some nodes, the first round prediction is missing.

By associating (conditional) probabilities with the DAG, we formulate our problem as a Bayesian

Belief Network (BBN).

We implemented a popular exact inference method, variable elimination, to infer

( , , , ,..., | ^, ^,..., ^, DAG )

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Validation Results

Diagnosed Results for GDS563

(1) Nervous system disorder

(2) Neuromuscular Diseases

(3) Myopathy

(4) Musculoskeletal Diseases

(5) Congenital, Hereditary, and Neonatal Diseases and Abnormalities

(6) Genetic Diseases, Inborn

(7) Genetic Diseases, X-Linked

(8) Muscular Disorders, Atrophic

(9) Muscular Dystrophies

(10) Muscular Dystrophy, Duchenne

Table 1.

the phenotype annotation set for the dataset NCBI GEO GDS563

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Case study GDS2251: Comparison of myeloid

Bayesian prediction:

Precision 59.1% recall 92.9% leukemia cells to normal monocytes.

2. monocyte

• Subset 1: normal

Chromosome abnormality and translocation are highly related to myeloid leukemia.

• Subset 2: myeloid leukemia

5. Myeloid Leukemia original set of Bayesian annotations but “Bone Marrow

1. Leukocytes, Mononuclear

Marrow” should be a correct

12 3

15

17

19

20

8 8

16

23 18

9 9

14 21

22

22. Myeloproliferative disease

23. granulocyte

…etc.

Results: Leave-One-out Cross Validation

Our method achieved an overall accuracy of 95%

(precision 82% and recall 20%)

The problem is analogous to other biological hierarchical multilabel classification problem, such as gene function prediction, which has achieved the best performance in mouse model at the recall rate of 20% and precision rate 41% (L. Pena-

Castillo et al.

, Genome Biol , 2008).

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Reduce the number of datasets

Further accumulation of datasets would increase the power of our method.

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Discussions

• Power of this type of approach will increase daily with the continuous and rapid accumulation of genomics data in the public repositories.

• Our diagnosis system is also promising in its potential to reveal unexpected disease connections, and further to construct novel phenome networks.

• Better tools for text-mining are needed!

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Drugdisease connectivity map

Discussions

• Power of this type of approach will increase daily with the continuous and rapid accumulation of genomics data in the public repositories.

• Our diagnosis system is also promising in its potential to reveal unexpected disease connections, and further to construct novel phenome networks.

• Better tools for text-mining are needed!

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

UMLS concepts:

Bone Marrow Cells

Dysmyelopoietic Syndromes

Hematopoietic stem cells

Immunoproliferative Disorders

Leukemia, Myelocytic, Acute

Lymphoproliferative Disorders

Myeloid Leukemia

Nonlymphocytic Leukemia, Acute

Stem cells leukemia

The authors’ writing style affects the UMLS annotations.

Joint work with X.J. Zhou (USC), J.C. Liu (USC)

Ongoing Work I

• To further improve the method prediction power (Ci-Ren Jiang)

– (focusing on the second stage) Applying a different way for a more thorough collaborative error correction along the disease hierarchy

– Previous Bayesian Belief Network model is time consuming and only allows one-way information exchange

To integrate the large amount of data on various diseases to build a diagnosis database such that users can rapidly search the disease profiles for expression similarities and further for disease annotations to the query sample of interest.

We consider our disease diagnosis question as a classification problem , where each UMLS concept represents an individual class, and all of the classes are organized in a hierarchy.

The outcome of our analysis is to categorize a standardized query dataset into several classes in the hierarchy. This type of general setting is a socalled hierarchical multilabel classification

(HMC) in the machine learning field.

One Example

Draft Idea

Some Results

Ongoing Project II

• Focusing on particular disease diagnosis by collaborating with specialized medical doctors/researchers, e.g., comparing bipolar disorder vs schizophrenia (Wayne Lee; Dr. Fei

Wang, MD, PhD, Psychiatry, Yale University)

– To investigate the usefulness/effectiveness of highthroughput molecular information in distinguishing between bipolar disorder and schizophrenia

– To compare existing clinical predictors with predictors from microarray data

Acknowledgements

– Dr. X. Jasmine Zhou (Molecular and Computational

Biology, USC)

– Dr. Jim Chun-Chih Liu (Molecular and Computational

Biology, USC)

– Dr. Ming-Chih Kao (Stanford Hospital, PhD, MD)

– Dr. Ci-Ren Jiang (UCB)

– Wayne Lee (UCB)

– Dr. Fei Wang (Yale U)

Thank You!

Barutcuoglu et al 2006

• Start with independently trained classifiers

(hard-margin linear SVMs) for each class without thresholding the outputs

• Assume the aggregate classifier outputs to have

Gaussian distributions for positive and negative examples

• Design a Bayesian hierarchical combination scheme to allow collaborative error-correction over all nodes

Clus-HMC (Vens et al 2008)

• Apply predictive clustering tree (PCT) to hierarchical multi-label classification

• The example labels are represented with

Boolean components

• Weighted Euclidean distance is used to measure similarities

• The class weights decrease with the depth of the class in the hierarchy

Clus-HMC (Vens et al 2008)

References

• Zafer Barutcuoglu, Robert E. Schapire and Olga

G. Troyankaya (2006), “Hierarchical multi-label prediction of gene function,” Bioinformatics 22,

830-836

• Celine Vens, Jan Struyf, Leander Schietgat,

Saso Dzeroski, and Hendrik Blockeel (2008),

“Decision trees for hierarchical multi-label classification,” Machine Learning 73, 185-214

Background: bioinformatics

Bioinformatics is an interdisciplinary research area which may be defined as the interface between biological and mathematical

(i.e., math, statistics, computing) sciences.

In bioinformatics research,

• Biological data are often high dimensional, complex and noisy, e.g.,

– NCBI GEO contains the microarray data produced by thousands of research teams (cross-platform/laboratory variations )

– Some investigations may rely upon a variety of experimental technologies (heterogeneous data types)

• The demands on statisticians are substantial

– precise understanding of the underlying biological principles

– strong analytical, modelling and data manipulation skills

Download