Computational analysis of genome- wide expression data Paul Pavlidis Columbia Genome Center

advertisement
Computational analysis of genomewide expression data
Paul Pavlidis
Columbia Genome Center
pp175@columbia.edu
Lecture overview
1. Microarray technology and applications
2. How the data is collected and what you get.
3. “High level” analysis methods: applied to the
study of human sarcoma.
1. Supervised and unsupervised learning.
2. Feature selection.
4. Method for applying biological prior knowledge.
5. (If there is time) Further applications of the
technology.
Review: gene expression
• DNA  pre-mRNA  mRNA  protein
• Many potential steps for regulation.
• Many genes are differentially transcribed according to:
• tissues
• cell types
• in various disease, physiological, and developmental states.
• mRNA is easy to quantify using hybridization assays.
• protein levels harder to measure in a high-throughput
assay.
Microarrays
• Thousands of small (20-200m) spots of DNA
probes on a glass slide.
• Use to measure gene expression (RNA) levels of
10,000-20,000+ genes in parallel.
• Old way: Northern blots etc. let you measure one gene
at a time.
• Generally give only relative expression
information.
• Methods exist for getting absolute measurements
(SAGE, calibrated arrays)
• Yields a type of snapshot of the molecular state
of the sample.
Applications for microarrays
• Diagnosis: molecular ‘portraits’ of disease
• Preventative medicine - early detection.
• Personalized treatment.
• Refined diagnosis and prognosis.
• Disease/phenotype characterization: What genes are
affected by condition X? (esp. complex traits)
• Gene expression regulation network elucidation: what
happens to the expression of gene Y if you knock out
gene X?
• Mutation and polymorphism detection
• Genome analysis: gene finding/structure determination
• Sets a model for high-throughput technologies:
• Protein arrays
• Post-translational modification assay arrays
• Protein interaction arrays
Why microarrays are of interest to
computational biologists
• Can generate a lot of data very quickly
(compared to what biologists usually deal with).
• Many studies include 50 samples or more =
~1,000,000 data points.
• Messier than sequence data (in interesting ways)
• Continuous rather than discrete
• Many more sources of variability, including biological.
• Can ask questions you can’t ask of sequence
data.
Microarray technology
Oligo (one-channel)
Spotted (two-color)
Doing a microarray experiment
Common experimental designs
• Compare an experimental condition to a control
condition (hopefully with replication).
• mutant vs. wild type.
• diseased vs. normal.
• Assemble a compendium of sample/conditions or
time points.
• Stages of the cell cycle.
• Different tumor samples.
Computation and microarrays I: Data
acquisition and preprocessing
•
•
•
•
Scan of fluorescent-labeled array to get an initial image
 Spot-finding
 Signal and background determination
 Calculation of expression measure for each spot
• (ratiometric array)  expression ratio
• (Affymetrix array)  ‘signal’ – combine signals from multiple probes for a
gene.
•  Normalization (correct for non-biological systematic errors)
•  Expression data matrix
• All of these steps present
statistical/analytical/computational problems.
Values for genes on one array
• Two color arrays: ratio of “red” to “green” intensities.
Usually expressed as log(R/G).
• One-color arrays (Affymetrix): “signal” – just a relative
measure of expression of the gene.
• Either way we have a number that represents the
expression level of the gene.
• To keep things simple, for two-color arrays a common
reference sample R is often used for all arrays.
A/R B/R C/R D/R E/R etc.
• Equivalent for one color arrays:
A
B
C
D
E
False color images of spotted array
•
•
•
•
•
•
Overlay of two scans of the slide
Compares the two samples
Green = less relative expression
Red = more relative expresion
Yellow = equal expression
Dimmer colors = lower expression levels.
Normalizing two-color arrays
• Due to imbalances in dye labeling, the signals for the two colors are
rarely “balanced”.
• There are many other sources of non-biological systematic error.
• Normalization attempts to correct for this.
• More complicated than it sounds because of multiple error sources.
• All types of arrays have to be normalized before array-array
comparisons are possible
before
after
Combined data for multiple arrays:
Expression data matrix for an
experiment
•
The matrix entry at (i, j) is the expression level of gene i in
experiment j.
Computation and microarrays II: Some
‘High level’ analysis methods
• Clustering (of genes and/or samples).
• Statistical analysis to identify ‘changed’ genes or correlate
genes with experimental factors.
• Analysis of variance
• Regression
• Supervised learning of sample or gene categories.
• i.e., support vector machine, k-nearest neighbor, etc.
• Data mining and statistical methods such as:
• Principal components/singular value decomposition
• Multidimensional scaling
• Visualization
• Prediction of gene-gene interactions (pathways/networks)
An extended example:
Microarray Analysis of Human Sarcoma
Collaboration with
Memorial Sloan Kettering Cancer Center
What is sarcoma?
• Sarcoma = “fleshy growth”
• Tumors of connective tissues, nerves, fat,
muscle, etc. i.e. mesoderm, neuroectoderm.
• 1% of new cancer cases.
• Many cases (30%?) are referred to MSKCC
• Other major types of cancer:
• carcinoma, which come from epithelial cell lineages.
• leukemia and lymphoma (blood and lymph derived)
Nine types of sarcoma studied as delinieated by histology and (*) molecular markers
•
•
•
•
•
•
•
•
•
•
liposarcoma (lipo) - fat
dedifferentiated liposarcoma (lipodediff)
pleomorphic liposarcoma (lipopleo)
round cell liposarcoma (roundcell)*
fibrosarcoma (fibro) - fibroblast
leiomyosarcoma (leio) - smooth muscle
malignant fibrous histiocytoma (MFH) - pleomorphic
synovial sarcoma (synovial) - joint-cell like*
gastrointestinal stromal tumor (GIST) * - c-kit: Gleevec
clear cell sarcoma (clearcell) - pigmented *
Are these types distinguishable at the
level of RNA expression?
Stripped-down version of the approach
• Affymetrix genechips (12500 genes/ESTs) run on
RNA from each sample.
• Final data set has 52 samples.
• Clustering - Which types of tumor cluster
together well?
• Feature selection + SVM - Which types are
learnable? Which samples are misclassified?
What genes distinguish each class?
Clustering
• Unsupervised learning
• Goal: identify groupings
in the data.
• Based on some notion
of similarity of the
profiles being clustered
•
•
•
•
Euclidean distance
Correlation
Manhattan distance
many others…
• Algorithms:
• Hierarchical
• k-means
• self-organizing maps
Clustering genes
• Hypothesis: genes with related functions will
cluster together (are “coexpressed”).
• True for some classes of genes, but not generally
true.
• “Function” is too broad a term, and we don’t always
expect genes with related functions to be
coexpressed.
• Seems to apply most often to ‘housekeeping’
functions.
• Probably the most overapplied method in microarray
analysis.
• Genes which are coexpressed are potentially
coregulated, but not generally.
Clustering samples
• When the samples are ‘pseudoreplicates’ (i.e,
samples from individual patients), clustering may
reveal known or unknown classes.
• In the context of the sarcoma data, we can see:
• If histologically defined classes cluster on the basis of
gene expression.
• Which classes are most similar to other classes.
• If there are previously unrecognized subtypes within a
histologically defined class.
Sarcoma clustering results
The alternative to clustering:
Supervised learning
Training set
Genes
Experiments
Learner
Model
Class membership
Genes
Experiments
Test set (“unknowns”)
Predictor
Predicted Class
With supervised learning, can ask:
• Which classes are recognizable as such
(“learnable”).
• When classification errors are made, are they
telling us something about the label on the test
sample?
• We can identify ‘mislabeled samples’ this way.
• Potentially allow us to make diagnoses and
predictions for new samples.
• Obviously we need to have classes defined first.
Support vector machines
+
+
+
+ +
+
+
+ + + - - +
-++ +
-
-
Locate a
plane that
separates
positive from
negative
examples.
-
-
-
Focus on the examples closest to the boundary.
Feature selection
• For tissue classification, each gene is a feature.
• Most genes are not informative about the classes we are
learning – they aren’t relevant.
• Many genes are not evenexpressed in the tissue
assayed.
• Including too many ‘noisy’ features degrades learning
performance.
• It is common to attempt to identify features that will be
most useful for learning.
• Those features (genes) are also the ones which
are most associated with the particular tumor
type.
Selecting genes with a t-test
m1  m 2
v v

n1 n2
μi = mean expression value in class i
ni = number of examples in class i
v = pooled variance across both classes
Other methods exist:
Analysis of variance, t-test variants, non-parametric methods, etc.
Features selected: GIST (an easy class)
• Light colors = higher expression
• Student’s t-test
• In decreasing order of p-value
Probe
31792_at
39591_s_at
40095_at
1888_s_at
39593_at
37221_at
35871_s_at
2081_s_at
37669_s_at
39592_r_at
35162_s_at
32351_at
38434_at
38949_at
36918_at
41325_at
38291_at
36394_at
41417_at
35285_at
ttest
37.1
33.1
31.3
30.7
29
22.9
22.4
21.8
21.3
20.6
20.6
20.5
20.2
19.6
19.5
19.2
19.1
18.8
18.7
18.6
Genbank
M20560
Z36531
J03037
X06182
AI432401
M31158
AF011390
L07032
U16799
Z36531
D31770
U66579
M95627
L01087
Y15723
AF006823
J00123
AB012293
AC003108
AF007216
Description
lipocortin-III
fibrinogen-like protein
carbonic anhydrase II
c-kit
EST
cAMP-dependent protein kinase subunit RI
pancreas sodium bicarbonate cotransporter
protein kinase C theta (PKC)
Na,K-AT Pase beta-1
fibrinogen-like protein
osteosarcoma mRNA for activin typeII A r
putative G protein-coupled receptor (GPR2
angio-associated migratory cell protein (AA
protein kinase C-theta
soluble guanylyl cyclase
T WIK-related acid-sensitive K+ channel (T
enkephalin
LY6H Mrna
BAC
sodium bicarbonate cotransporter (HNBC1
Features selected: MFH (a harder class)
Welch’s t-test
Student’s t-test
Fisher’s disc.
Hold-one-out cross-validation of
SVM with feature selection
1...52
1: hold out one sample
2: select features
22, 4,..,13
3: train SVM
4: classify held-out sample
Dedifferentiated Lipo.2
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
false positive ()
True positive (“refined” classes”) ()
false positive (“refined” classes”) ()
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
2
3
4
5
6
7
8
9
10
11
12
13
Number of occurrences
True positive ()
0 2 4 6 8 1012
0 2 4 6 8 1012
Leiom yosarcom a2
2
3
4
5
6
7
8
9
10
11
12
13
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
Dedifferentiated Lipo.
Pleom orphic Lipo.
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
MFH2
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
Fibrosarcom a
CCS
Round Cell Lipo.
log2Number of features
2
3
4
5
6
7
8
9
10
11
12
13
MFH
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
2
3
4
5
6
7
8
9
10
11
12
13
0 2 4 6 8 1012
0 2 4 6 8 1012
0 2 4 6 8 1012
Leiom yosarcom a
2
3
4
5
6
7
8
9
10
11
12
13
Synovial Sarcom a
GIST
0 2 4 6 8 1012
2
3
4
5
6
7
8
9
10
11
12
13
What about all those genes we
selected?
• What can we say about the genes which
distinguish particular classes?
• They are presumably telling us something about
the biology of the tumors:
• Hints about causes?
• Drug targets?
• Provide markers for diagnosis or targeting drugs?
• An expert can look at the results and come up
with a story about each gene, for each class.
• Can we do this computationally?
Making use of biological prior
knowledge
• There is a huge biological knowledge base - we
know a lot about many genes.
• How can we layer this information on the gene
expression data?
• Would like to make seamless, optimal use of:
•
•
•
•
Gene function knowledge
Genetic mapping information
Sequence data – from other species too.
etc., etc., etc.
Relating expression to gene function
• Given a microarray data set, often want to ask:
Are there any functional commonalities among
the genes which were affected?
• Typical approach:
• “This cluster contains a lot of ribosomal protein genes”.
• “If you look at the citric acid cycle genes, a lot of them changed
during the experiment”.
• More efficient, structured alternative: Class scoring.
Class scoring: basic idea
Given: Expression data and functional annotations (class labels) for the genes.
Task: Find the interesting gene classes.
Solution: Give each class a score.
• “Class” is any biologically
meaningful set of genes.
• “Semi-supervised”
• Scores can be generated
in multiple ways.
Sources of annotations
• Gene Ontology: a controlled vocabulary for describing
gene function
• http://www.geneontology.org
• Most of the major species genome databases are adopting it for
annotation.
• MIPS catalog of yeast genes (likely to be superceded by
GO)
• http://mips.gsf.de/
• Both are hierarchys of terms.
• Unfortunately, currently not all genes are annotated well
(or at all).
GO example
(Browser at http://www.godatabase.org/cgi-bin/go.cgi)
Two gene class scoring methods
What makes a gene class “interesting”?
• Similarity of the expression profiles.
• Correlation score: Are expression profiles of the genes
in the class similar? (Akin to clustering)
• Significant effects of experimental treatments.
• Experiment score: Do genes in the class have good
group-comparison statistics (ttest, ANOVA, etc.) ?
Correlation score: details
1.
2.
3.
4.
5.
6.
Get the data for one class.
Measure the correlations between genes in the class. (Pearson correlation coefficient)
Take the average of the correlations as the score for the class.
Calculate p-value for the class
Repeat steps 1-4 for many classes.
Classes with best p-values are ‘most interesting’.
Data
Class
data
n*(n-1)/2
pairwise
correlations
average
Correlation score: example
•
Yeast data
• Fermentation, class correlation 0.46, p < 10-5
• Morphogenesis, class correlation ~0, p ~ 1
Experiment score: details
1.
2.
3.
4.
5.
6.
data
Perform an analysis of variance, t-test, or other appropriate statistical test on each gene in
the data set, yielding a score (p-value) for each gene.
Get the statistics data for a class.
The average of the -log p-values for the genes in the class is the score for the class.
Convert the raw score into a pvalue.
Repeat 2-4 for many classes.
Classes with best pvalues are ‘Most interesting’.
gene p-values
stats
for class
test each gene
average
Experiment score: example
•
One-way ANOVA performed on data from three leukemia types (Data from Golub et al.)
T-cell receptor: ave -log (pvalue) = 4.6, p<10-5
ALL-B-cell
ALL-T-cell
AML
Transferases: ave -log (pvalue) = ~ 1.5, p ~1
ALL-B-cell
ALL-T-cell
AML
Converting raw scores into p values
How likely are we to get a class with a given score by chance?
• The distribution of scores
is affected by the class
size.
• Use empirical
Class size 30
measurement of score
distributions generated by
random samples of the
data.
10000
9000
8000
Counts
7000
6000
5000
4000
3000
Class size 2
2000
1000
0
0
1
2
3
Class score
4
5
6
Correlation and experiment scores are
largely complementary
Mouse brain region data (Sandberg et al.)
Correlation score method tends to
select ‘housekeeping’ classes
• Across different experimental designs,
organisms, and tissues:
•
•
•
•
Ribosomal proteins/protein synthesis
Mitochondrial energy production/TCA cycle
RNA processing/spliceosome
Protein degradation/proteasome
• Suggests that:
• These genes are always tightly coregulated.
• Results of correlation score are not specific to the
experiment at hand – “biological noise”.
Experiment score tends to select
classes that are relevant to the
specific experimental situation
• T-cell receptors, immune system for leukemia.
• Synaptic transmission, myelination, ion channels for brain
region data.
• True for the yeast data but plenty of overlap with the
correlation score.
• This is observed in additional data sets we have
analyzed: Cancer, obesity, human brain, etc.
• Additional “unexpected” classes are also identified in
many experiments.
Class scoring on the refined MFH class
Class scoring for GIST
Issues for the future
• Using biological prior knowledge to make the most of
microarray data.
• How to combine and interrelate various genome-wide
data types, including microarrays.
• Extracting information about genetic networks from the
data.
• Need more data!
• Can microarray data be used the way sequence data is
used, or is it too messy?
• Comparing data between labs, platforms, and organisms.
• How will protein arrays and other developing technologies
fit in?
A few resources
• Stanford Microarray Database
• http://genome-www5.stanford.edu/MicroArray/SMD/
• Whitehead Institute: Cancer Genome Research
• http://www-genome.wi.mit.edu/cancer/
• NCBI gene expression omnibus
• http://www.ncbi.nlm.nih.gov/geo/
• Some software is available from my website:
• http://rbp1sun.cpmc.columbia.edu/
• Gene Ontology:
• http://www.geneontology.org
Further applications and methods:
examples from the literature
• Determination of gene structure
• Finding genetic regulatory pathways
Using arrays to determine gene
structure
• Computational approaches are currently
inadequate.
• First/last exons often incorrectly predicted.
• Alternate splicing very difficult to detect.
• Experimental approaches can be very effective,
but the genome is very large...
• One answer: genome-scanning and tiling arrays.
see Shoemaker, et al., Nature 2001 (genome issue)
Genome tiling arrays
Exon assay arrays
• Arrays contain probes for known and predicted exons.
• Assays run on 60 different human tissues.
• Can be used to determine the accuracy of computational
predictions.
• Can detect alternative splicing.
Finding pathways (or pieces of them)
• Popular data set: Mutations in 300 known yeast
genes (not all of known function)
• Microarray analysis of each of the 300 strains
• We can ask, which genes are affected by
knocking out gene X?
• One example: Pe’er et al., ISMB 2001 (using
data from Hughes et al., Cell 2000)
Overview of data
Example of a subnetwork found
A few resources
• Stanford Microarray Database
• http://genome-www5.stanford.edu/MicroArray/SMD/
• Whitehead Institute: Cancer Genome Research
• http://www-genome.wi.mit.edu/cancer/
• NCBI gene expression omnibus
• http://www.ncbi.nlm.nih.gov/geo/
• Some software is available from my website:
• http://rbp1sun.cpmc.columbia.edu/
• Gene Ontology:
• http://www.geneontology.org
Download