2006yg-Lecture1 (2) - Department of Computing

advertisement
Introduction to Bioinformatics
Data analysis of bio-activities for drug discovery
Course 341
Department of Computing
Imperial College, London
Yike Guo
Vasa Curcin
Henry Morris
Recommended Texts

For this part of the course
–
–

General overview of microarray data analysis
–
–

Lecture Notes
Handouts
“Microarray Gene Expression Data Analysis: A Beginner’s Guide”
(Causton, Quakenbush and Brazma)
“Microarray Bioinformatics” (Stekel)
Data Mining
–
“Data Mining: Concepts and Techniques” (Han)
Goal:



Understand the basic bioarray technology
including microarray technology for gene
expression, NMR spectroscopy and other high
throughout devices
Learn the basic analytical technology and its
applications to the bioarray information
Learn the analysis processes of processing
and analysing bioarray data (e.g. gene
expression analysis)
Lecture Overview







Lecture One : BioArray Informatics in Drug Discovery
Lecture Two : BioArray Technology
Lecture Three : Analysis Technology (1)—Data
Normalisation and Transformation
Lecture Four : Analysis Technology (2)--Clustering
Lecture Five : Analysis Technology (3)– Classification
and Ontology
Lecture Six : Integrative Analysis of gene expression
data
Lecture Seven: Kernel Method
Lecture Eight : Analysis for NMR Metabinomics Data
The Drug Discovery Process
database
/genes
protein chemical
targets diversity
identify
‘hit’
optimize
‘hit’ structure
test safety/efficacy
animals
The aim is to translate new information into new
therapies
humans
Complexity of Drug Discovery
Finding a Molecule that Satisfies Multiple Criteria
1 Drug
Molecule
patentable
non-teratogenic
10,000 Drug
Candidates
Valid
Biomedical
Hypothesis?
Complexity of Drug Discovery
Finding a Molecule that Satisfies Multiple Criteria
1 Drug
Launch
Cost-effective manufacturing
Carcinogenicity studies
10 Drug
Molecules
Bioarray : High Throughput
Measurement of Biological Activities





Gene Expression
Protein Expression
SNP
Metabonomic Expression
Chemical Hits
A Dynamics in BioArray Informatics
Interactions
Environment
Metabolites
DNA
RNA
Protein
Growth rate
Expression
Gene Expression





Cells are different because of differential
gene expression.
About 40% of human genes are expressed at
one time.
Gene is expressed by transcribing DNA into
single-stranded mRNA
mRNA is later translated into a protein
Microarrays measure the level of mRNA
expression
Gene Expression Measurement




mRNA expression represents dynamic aspects
of cell
mRNA expression can be measured with latest
technology
mRNA is isolated and labeled with fluorescent
protein
mRNA is hybridized to the target; level of
hybridization corresponds to light emission
which is measured with a laser
Gene Expression Microarrays
The main types of gene expression microarrays:
 Short oligonucleotide arrays (Affymetrix);
 cDNA or spotted arrays (Brown/Botstein).
 Long oligonucleotide arrays (Agilent Inkjet);
 Fiber-optic arrays
 ...
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Raw gene expression is intensity
difference: PM - MM
BioArray Informatics: BioArray is
the data, everything else is
Informatics








Data Engineering
Data Warehousing
Data Integration
Data Analysis
Knowledge Discovery
Discovery Integration
Discovery Validation
Knowledge Integration
Quantitative Analysis
Reproducibility
confidence
intervals
to find significant
deviations
Data Warehousing
Data Sources
External
Data Sources
Operational Data Sources
Sample & Clinical
Data
BioArray
Data
KEGG
Unigene
Genbank
Data Warehousing:
Experimental/Sample
Database
Expression
Database
Function
Annotations
Structure
Annotations
Data Schema in Warehousing :
A Gene Expression Example
Gene
Expression
Warehouse
OMIM
Disease
ExPASy
SwissProt
PDB
ExPASy
Enzyme
Protein
Enzyme
LocusLink
Affy Fragment
Known Gene
MGD
Sequence
Metabolite
SNP
SPAD
Sequence
Cluster
NCBI
dbSNP
Genbank
NMR
Pathway
UniGene
KEGG
A Workflow of Gene Expression
Database
Data Reduction Queries
GXDW
Comparisons
between 2 samples
Set Fold
Change
(e.g., > 2X)
User defined
dataset
Warehousing
Output
Comparisons
between multiple
samples
Profile Report
Data in
analysis
Set higher avg difference
value (e.g., >200)
Visualisation
A->P/ P->A stringency
(e.g., 80%)
Advanced
Gene Expression
Analysis
Queries, Queries…..

Query to the data
– Which genes are linked ?
– Which genes are expressed similarly to my gene XYZ?
– Which genes are co-expressed in differing conditions ?
– classification (of tumors, diseased tissues etc.): which
patterns are characteristic for a certain class of samples,
which genes are involved?
– functional classification of genes: Are changes clustered in
particular classes?
– metabolic pathway information: Is a certain pathway/route in
a pathway affected?
– disease information & clinical follow up: correlation to
expression patterns.
– phenotype information for mutants: Are there correlations
between particular phenotypes and expression patterns?
Gene Expression Data Analysis
Work Flow
Data in
analysis
Interactive Analysis Procedures
Cluster by genes
Study outliers
Correlate clinical
measurements
Literature analysis
Time course analysis
Defined subsets of
genes
Classic drug targets
[Examples, not
exhaustive]
Known disease association
Cross species indices
Knowledge Deliverables
Microarray Data Analysis Types

Gene Selection
–
–

Classification (Supervised)
–
–

identify disease
predict outcome / select best treatment
Clustering (Unsupervised)
–
–

find genes for therapeutic targets
avoid false positives (FDA approval ?)
…
find new biological classes / refine existing ones
exploration
Microarray Data Mining Challenges






too few records (samples), usually < 100
too many columns (genes), usually > 1,000
Too many columns likely to lead to False
positives
for exploration, a large set of all relevant genes
is desired
for diagnostics or identification of therapeutic
targets, the smallest set of genes is needed
model needs to be explainable to biologists
Classification

desired features:
–
–
–
–


robust in presence of false positives
understandable
return confidence/probability
fast enough
simplest approaches are most robust
advanced approaches can be more accurate
Microarray Data Classification
Microarray chips
Images scanned by laser
Value
193
-70
144
33
318
1764
1537
1204
707
Datasets
New
sample
Prediction:
ALL or AML
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Data Mining
model
Class Sno D26528 D63874 D63880 …
ALL
2
193
4157
556
ALL
3
129 11557
476
ALL
4
44 12125
498
ALL
5
218
8484
1211
AML
51
109
3537
131
AML
52
106
4578
94
AML
53
211
2431
209
…
FALSE POSITIVES PROBLEM



Not enough records (samples), usually < 100
Too many columns (genes), usually >>1,000
FALSE POSITIVES are very likely because of
few records and many columns
Popular Classification Methods

Decision Trees/Rules
–



Neural Nets - work well for reduced # of genes
K-nearest neighbor - robust for small # genes
TreeNet from authors of CART and MARS
–

networks of simple trees; very robust against outliers
Support Vector Machines (SVM)
–

find smallest gene sets, but not robust false positives
...
good accuracy, does its own gene selection, but hard
to understand
Microarrays: An Example

Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
–
72 examples (38 train, 34 test), about 7,000 genes
well-studied (CAMDA-2000), good test example
ALL
AML
–
Visually similar, but genetically very different
Results on the test data



Genes selected and model trained on Train set
ONLY!
Best Clementine neural net model used 10
genes per class
Evaluation on test data (34 samples) gives
–
–
1 or 2 errors (94-97% accuracy),
Note: all methods give error on sample 66, believed
to be mis-classified by a pathologist
Clustering
Goals
 Find natural classes in the data
 Identify new classes / gene correlations
 Refine existing taxonomies
 Support biological analysis / discovery
 Different Methods
–
Hierarchical clustering, SOM's, etc
Yeast SOM Clusters

Yeast Cell Cycle SOM.
www.pnas.org/cgi/content/full/96/6/2907

(a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30
clusters. Each cluster is represented by the centroid (average pattern) for genes in
the cluster. Expression level of each gene was normalized to have mean = 0 and
SD = 1 across time points. Expression levels are shown on y-axis and time points
on x-axis. Error bars indicate the SD of average expression. n indicates the
number of genes within each cluster. Note that multiple clusters exhibit periodic
behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail.
Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in
late G1. Normalized expression pattern of 30 genes nearest the centroid are
shown. (c) Centroids for SOM-derived clusters 29, 14, 1, and 5, corresponding to
G1, S, G2 and M phases of the cell cycle, are shown.
Yeast SOM Clusters
BioArray Informatics: Data Analysis of Bioarray Data
within the Biological Context
secondary structure
tertiary structure
polymorphism
patient records
epidemiology
expression patterns
physiology
sequences
alignments
ATGCAAGTCCCT
AAGATTGCATAA
GCTCGCTCAGTT
receptors
signals
pathways
linkage maps
cytogenetic maps
physical maps
An illustration of iterative analysis
of Bio-activities
Gene
2
1
3
9
Receptor
4,5,6
Protein
7
Relations
1- gene homologs
2- gene encodes a protein
3- protein can regulate the expression of a gene
4- protein phosphorylates another protein
5- protein binds to another protein
6- protein lyses another protein
7- Proteins can sometimes be receptors
8- Receptors bind a ligand
9- Receptors (if bound) activate other proteins
Ligand
8
Advanced Analysis

Discovery Annotation and Validation
–
–
–

Integrative Analysis
–
–

E.X. Annotating a set of co-expressed genes with
some conserved regulatory motifs
E.X. Scoring a co-expression pattern with pathways
E.X. Literature analysis to annotate biological meaning
E.X. Multi-modality Analysis
E.X. Cross Annotation of Discovered Patterns
Modelling and Simulation
–
–
E.X. Pathway Synthesis
E.X. Virtual Cell Modelling
Multi-Modality Analysis
“REAL WORLD”
“INPUTS”
NOXIOUS AGENT/STRESSOR
“OUTPUTS”
“BIOLOGICAL END-POINTS”
PATHOLOGY
ALTERED PHYSIOLOGY
AND METABOLISM
“-OMICS WORLD”
Time
Gene Profile
Time
Time
Protein Profile
Time
Time
Metabolic Profile
A mathematical model
forwards-propagated
correlations
metabolites
protein
mRNA
time
event
Integrated Analysis of Metabonomic (plasma) versus
Muscle Gene Expression Data for Insulin Resistance
(Prof. Jeremy Nicolson et.al).
zones showing high
gene-metabolite correlations
Strongly weighting
variables WT > KO
Minimally weighting
variables
Strongly weighting
variables KO > WT
Red = mRNA expression levels for each gene/EST
Black = individual quantitative single pulse plasma NMR spectral descriptors
Integration of RNA and NMR results
Discovery of causal processes


A long term goal of Systems Biology is to
discover the causal processes among genes,
proteins, and other molecules in cells
Can this be done (in part) by using data from
High Throughput experiments, such as
microarrays?
Bayesian Causal Network Structure
P(GAL4)
P(GAL2 | GAL4)
P(Intracellular Galactose | GAL2)
Each variable is independent of
its distant causes given all of its
direct causes.
Thanks to Greg Cooper, U. Pitt
Bayesian Network Learned for
Yeast
Hartemink et al, Combining Location and Expression Data for
Principled Discovery of Genetic Regulatory Network Models,
PSB 2002 psb.stanford.edu/psb-online
Integrate biological knowledge when analyzing
microarray data (from Cheng Li, Harvard SPH)
Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
Enjoy the lecture so you can find a
drug (or many jobs)
Download