Chang et al., PLoS Biol. 2004

advertisement
Gene expression studies of cancer:
gene transcription signatures
Chad Creighton
January 2011
Oncogenic signaling pathways in cancer
Mutation/deregulation
of a handful of genes
can make cells into
cancer cells.
Hanahan and Weinberg. Cell. 2000 100:57-70
Widespread deregulation of gene
expression in cancer
• Gene
expression
profiling
distinguishes
prostate
cancer from
normal
prostate and
from BPH.
Dhanasekaran et al.
Nature. 2001 Aug
23;412(6849):822-6.
Widespread deregulation of gene
expression in cancer
• Gene
expression
profiling
identifies
different
subtypes of
breast cancer.
Sorlie et al. PNAS. 2003
100(14):8418-23
A gene-expression signature as a
predictor of survival in breast cancer
www.agendia.com
Van de Vijver et al. NEJM 2002
347(25):1999-2009.
Oncogenic pathway signatures in human
cancers as a guide to targeted therapies
Bild et al. Nature. 2006
439(7074):353-7.
• Use
oncogenic
signatures
to predict
response
of cell lines
to targeted
therapy.
Oncogenic signatures of ERBB2, EGFR,
MEK, RAF, and MAPK in breast cancer cells
Creighton et al. Cancer
Res. 2006 66(7):3903-11.
Preliminary gene expression
profiling studies of cancer
• Hundreds of genes are deregulated in
cancer.
• Different subtypes of cancer are defined
by gene expression profiling.
• Gene expression signatures may predict
cancer patient survival.
• Gene expression signatures of oncogenic
signaling pathways can be defined using
experimental models (cell lines, mice).
Potential uses for gene expression
profiling of cancer
• Define and understand the molecular
pathways that underlie cancer.
• Define subgroups of patients for the
purposes of optimizing treatment.
– Determine whether or not a patient would
benefit from a given therapy (e.g.
chemotherapy).
– Determine what specific pathways are
deregulated in the tumor and treat the tumor
with therapies that target that pathway (e.g.
hormone therapy for ER+ breast cancer).
General concepts of gene
expression analysis
• Low level analysis
– Processing image files
– Normalization
– Quality Control (QC)
• High level analysis
– Clustering
– Selecting differentially expressed genes
– Enrichment analysis or “Meta-analysis”
Publicly available gene expression
profile data represents a rich
resource
• When publishing studies using gene
expression profile data, authors are
encouraged to make the data available to
everyone.
• Subsequent studies can re-analyze the
data with different questions in mind from
what the original authors had.
• GEO database
(http://www.nc
bi.nlm.nih.gov/
geo/) make
thousands of
expression
profile datasets
publicly
available.
• Many top
journals
require
microarray
studies to
make data
public on GEO
Pathway-related gene sets:
Gene Ontology (GO) terms
• The Gene Ontology project provides a controlled
vocabulary to describe gene attributes.
• Three major categories:
– Cellular component
– Biological process
– Molecular function
• The controlled vocabularies are structured so
that they can be queried at different levels:
– For example, use GO to find all gene products
involved in ‘signal transduction’, or zoom in on all
‘receptor tyrosine kinases’.
www.geneontology.org
Pathway-related gene sets:
Molecular Signature Database
(mSigDB)
• From the Broad Institute
• Collection of gene sets curated from the
literature (including gene expression
profiling studies).
• Current version represents over 1800
pathway-associated genes sets
http://www.broad.mit.edu/gsea/msigdb/index.jsp
Gene “signatures”
• Will be loosely defined here to mean a set
of genes that are functionally associated
with each other in some way.
• Ways to define gene signatures:
– Gene annotation (e.g. Gene Ontology terms)
– Curated pathway-associated gene sets
– Literature review articles
– “Gene expression signature”, gene signature
defined using expression profiling data
• e.g. what genes go up or down in response to
treatment in an experimental model)
Gene expression signatures
• When using expression profiling to define genes,
a gene expression signature consists of two
things:
– A set of genes going “up” (relative to something).
– A set of genes going “down” (relative to something).
• Relative direction of the genes (up-regulated vs
down-regulated, or over-expressed vs underexpressed) is important.
• Keep the “up” genes separated from the “down”
genes.
How do we relate gene expression
profile results from different
datasets to each other?
The enrichment problem
• A: Given a gene set or sets of interest.
– i.e. a “gene signature”
• B: Given an independent expression dataset
with the profiled genes being ranked by a
specified metric.
– e.g. “cancer vs. normal” or “correlation with MYC.”
• Are the genes in (A) enriched within (B)?
– i.e. do the results of (A) and (B) overlap
significantly?
Methods for determining enrichment
• Venn diagram, or “marble jar” approach
– Take the top set of genes from the expression dataset
(dataset B), tabulate the amount of overlap with the
independent gene set of interest (dataset A).
• Rank-based approach
– Use the entire dataset, including genes of borderline
significance or showing a weak trend towards
significance.
• Correlation approach
– For a set of genes, compute correlation between two
sets of weighting factors (based on different profiling
datasets).
Venn diagram enrichment analysis
• Requires us to make a “cut” to define what the
top genes are.
• Significance of overlap may be determined by
chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis
Define gene set
of interest
• Requires us to make a “cut” to define what the
top genes are.
• Significance of overlap may be determined by
chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis
Define differentially
expressed genes
• Requires us to make a “cut” to define what the
top genes are.
• Significance of overlap may be determined by
chi-square or one-sided Fisher’s exact tests.
Venn diagram enrichment analysis
Determine overlap
between the two gene
sets
• Requires us to make a “cut” to define what the
top genes are.
• Significance of overlap may be determined by
chi-square or one-sided Fisher’s exact tests.
Hypergeometric formula (one-sided
Fisher’s exact test)
• Number of genes in total population: G
• Genes in G falling under pre-defined
class: A
• Number of genes selected: k
• Number of selected genes k in class A: n
• The number of genes expected to
overlap by chance: (k X A)/G
• One-sided Fisher’s exact test determines
whether n is significantly greater than
(kXA)/G
Hypergeometric formula (one-sided
Fisher’s exact test)
•
•
•
•
•
Number of genes in total population: G
Genes in G falling under pre-defined class: A
Number of genes selected: k
Number of selected genes k in class A: n
The probability P for the term occurring n or
more times within a set of k genes randomly
selected from the population:
What is the total gene population (G)?
• Can represent the number of genes profiled
on the array chip.
• What if two different array platforms were
used (a different populatin of genes are
typically represented in each)?
– Use the common set of genes represented on
both array chips as the total population (do not
consider genes not represented on both arrays)
– Use ONE of the two array platforms to define the
gene population (do not consider genes on the
other array platform that are not represented on
the first platform)
A gene signature of mutation of
EGFR in NSCLC cell lines
• Compared lung
cancer cell lines with
or without an
activating mutation in
EGFR.
• Wanted to compare
this gene signature
with another gene
signature of EGFR
Lung cancer cell lines
Choi, Creighton, et al., PLoS ONE 2(11): e1226.
Oncogenic signatures of ERBB2, EGFR,
MEK, RAF, and MAPK in breast cancer cells
• Does the
published
MCF-7+EGFR
signature
overlap with
the NSCLC
EGFR
signature?
Creighton et al. Cancer
Res. 2006 66(7):3903-11.
Compare NSCLC EGFR mutant
signature with a signature of
EGFR-transfected MCF-7 cells
• EGFR wt NSCLC
genes: 119
• MCF7 EGFR
genes: 1152
• Genes shared
between
MCF7/NSCLC
array platforms:
11079
• Genes shared
between
significance of
MCF7/NSCLC
gene signatures:
One-sided Fisher’s exact test overlap p<1E-10
44
Choi, Creighton, et al., PLoS ONE 2(11): e1226.
A gene signature of mutation of EGFR
in NSCLC cell lines is enriched with
EGFR-depended genes.
Choi, Creighton, et al., PLoS ONE 2(11): e1226.
Experimental models versus
clinical tumors
• Molecular data from experimental models
represent dynamic information, but
clinical relevance is not always clear (e.g.
could represent experimental artifacts).
• Data from clinical tumor specimens
represent more static information, where
the associations observed may be
pathologically relevant.
Experimental models versus
clinical tumors
• From clinical data, cannot distinguish
cause-and-effect associations from
correlation alone.
• In cancer studies, important to combine
the experimental with the clinical.
– Some researchers may doubt the validity of
experimental results unless they can be shown
to apply to human tissues
Ranked-based enrichment analysis
Rank
ordered
genes from
dataset A
Locations
of genes
from set B
• Rank-based approaches use all of the genes
from one of the datasets to determine
enrichment (does not make a “cut”).
GSEA (rank-based) enrichment
analysis
All the genes in
the dataset are
used here
Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550
•Start from the top of the Ranked list.
•Add points to “Random walk” for each gene you find in S.
•Remove points from “Random walk” for each gene not in S.
GSEA Kolmogorov-Smirnov statistic
Consider the genes R1,.., RN that are ordered on the basis of the difference
metric between the two classes and a gene set S containing G members.
We define
if Ri is not a member of S, or
if Ri is a member of S.
We then compute a running sum across all N genes. The ES is defined as
or the maximum observed positive deviation of the running sum.
GSEA Kolmogorov-Smirnov statistic
• The ES score (the “peak” of the Random
walk) is just a number.
• Need to evaluate the significance of the
number by some type of permutation
testing:
– Permute the sample labels many times, OR
– Permute the gene sets (i.e. randomly generate
gene sets).
• In either case, compare distribution of
scores from random tests with the actual
score.
GSEA (rank-based) enrichment
analysis
Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550
Examples of GSEA running enrichment scores
GSEA (rank-based) enrichment
analysis
Subramanian, Aravind et al. (2005) Proc. Natl. Acad. Sci. USA 102, 15545-15550
Sets with genes not located at the top of the
ranked gene population may still yield significant
enrichment scores.
A mechanism of cyclin D1 action encoded in the
patterns of gene expression in human cancer
Lamb, et al. Cell 114:323-34, 2003
The Connectivity Map of gene
signatures induced by 164 different
small molecule inhibitors
Lamb et al., Science. 2006 313(5795):1929-35
The Connectivity Map
(Scoring derived from GSEA statistic)
Venn diagram vs Rank-based
methods
• Venn diagram results more easily
interpretable.
• For rank-based methods, genes that are
not at all significant individually may
contribute to enrichment.
– What gene do you go after for validation?
• With venn diagram, have to make a cut.
– May not include enough genes in the test.
Venn diagram vs Rank-based
methods
Venn diagram vs Rank-based methods,
what is a significant p-value?
• If using the Venn diagram method in
expression studies, p-value should be very
low if working with sizable gene sets (e.g.
<1E-6).
• If using rank-based method, can consider a
nominally significant p-value (e.g. p<0.05)
to be good if permuting the sample labels is
involved.
• Can always try both ways in order to be
certain of an enrichment association.
Correlation-based approach
• Take the correlation between two sets of
profiling results from different datasets.
• May use all of the genes profiled or a
specified subset (e.g. genes in a gene
signature).
• The correlation metric may be any one of
a number of valid metrics (e.g. Pearson’s
or Spearman’s rank).
Correlation-based approach
• Each gene used in the correlation may be
“weighted” in a number of ways
– t-statistic, comparing two groups
– Mean-centered expression values
– “+1” or “-1” for “up” or “down,” respectively
• Again, direction of the genes is important
– Positive correlation indicates similar overall patterns
between the two datasets.
• Example: IGF “activation score” from Creighton
et al., JCO 2008.
Example analyses comparing
gene transcription signatures
from different studies
A gene signature of Insulin-like
growth factor I (IGF-I)
• Substantial evidence implicates insulin-like
growth factor I (IGF-I) signaling in the
development and progression of breast cancer.
• Gene expression profiling of IGF-I-stimulated
MCF-7 cells was performed.
• An IGF-I gene signature was examined in
human breast tumors, as well as in experimental
models for specific oncogenic signaling
pathways.
Creighton CJ, et al., Lee AV. JCO. 26:4078-85.
Genes altered by IGF-I at 3hr or
24hr or both
A gene signature of Insulin-like
growth factor I (IGF-I)
Oncogenic pathway signatures in human
cancers as a guide to targeted therapies
Bild et al. Nature. 2006
439(7074):353-7.
• Examine
previously
published
dataset for
oncogenic
signatures
overlapping
with IGF
signature
The IGF is enriched for transcriptional
targets of the Ras pathway
The Connectivity Map of gene
signatures induced by 164 different
small molecule inhibitors
Lamb et al., Science. 2006 313(5795):1929-35
The IGF is enriched for transcriptional
targets of the PI3K/Akt/mTOR pathway
IGF signature is present in
human breast cancers
Widespread deregulation of gene
expression in cancer
• Gene
expression
profiling
identifies
different
subtypes of
breast cancer.
Sorlie et al. PNAS.
2003 100(14):8418-23
IGF signature is present in luminal B
and basal breast tumors
Data from Sorlie et al. PNAS. 2003 100(14):8418-23
IGF signature is associated with poor
prognosis in ER+ breast tumors
Relating gene expression profile results
from different datasets to each other by
unsupervised clustering methods:
USUALLY NOT A GOOD IDEA
• Unsupervised clustering is a technique for data
analysis that partitioning a data set into subsets
whose elements share common traits
• Many groups will try to relate a gene signature to
another dataset by clustering the samples in the
dataset using the genes in the signature
• The main problem with this: Unsupervised
clustering does not take the direction of the
genes in the signature into account.
• Identification of
a Common Serum
Response (CSR)
gene signature in
fibroblasts
• Starve fibroblasts,
then give them
serum and see what
genes are upregulated or downregulated.
Chang et al., PLoS Biol. 2004
Feb;2(2):E7
Survey of fibroblast CSR gene
expression in human cancers
• Using the
genes in the
CSR
signature,
cluster
human
tumors.
• Tumor form
two major
groups.
Chang et al., PLoS Biol. 2004 Feb;2(2):E7
Prognostic value of fibroblast CSR
in epithelial tumors
• Tumors in the
“activated”
group had
worse
outcome.
Chang et al., PLoS
Biol. 2004
What issues are these with this
type of analysis approach?
• The
clustering
method does
not tell us
which
direction the
CSR gene
are moving.
• Are genes up
in the CSR
signature
also up in the
“Activated”
tumor set?
What issues are these with this
type of analysis approach?
• These bars
indicate the
direction of
the CSR
genes in
these
clusters
(red=up)
• CSR pattern
does appear
here to be
manifested in
half the
tumors.
Excel functions/features you will
need for the computational
exercise
TTEST Worksheet function
TTEST(array1,array2,tails,type)
• Array1 is the first data set.
• Array2 is the second data set.
• Tails specifies the number of distribution tails
(Use “2” for the computational exercise.)
• Type is the kind of t-Test to perform (Use “2”).
AVERAGE Worksheet function
AVERAGE(number1, number2)
• Number1, number2,
... are 1 to 30
numeric arguments
for which you want
the average.
• The arguments must
either be numbers
or be names, arrays,
or references that
contain numbers.
Data->Filter->AutoFilter
1. Unfiltered range
2. Filtered range
• When you use the AutoFilter
command, AutoFilter arrows
appear to the right of the column
labels in the filtered range.
• Microsoft Excel indicates the
filtered items with blue.
• You use custom AutoFilter to
display rows that meet complex
criteria; for example, you might
display rows that contain values
within a specific range (e.g.
p<0.01)
MATCH Worksheet function
MATCH(lookup_value,lookup_array,match_type)
• Lookup_value is the value you use to find the
value you want in a table.
– Lookup_value is the value you want to match in
lookup_array. For example, when you look up someone's
number in a telephone book, you are using the person's
name as the lookup value, but the telephone number is
the value you want.
– Lookup_value can be a value (number, text, or logical
value) or a cell reference to a number, text, or logical
value.
• Lookup_array is a contiguous range of cells
containing possible lookup values. Lookup_array
must be an array or an array reference.
• Match_type should be set to 0 for our purposes.
COUNT Worksheet function
• If an argument is an array or reference, only
numbers in that array or reference are counted.
Empty cells, logical values, text, or error values in
the array or reference are ignored.
(Don’t forget the $)
R functions you will need for the
computational exercise
dhyper function in R
• Example:
– 100 balls
– 10 of the balls are red
– I grab 20 balls
– Five of my 20 balls are red
• Was the number of red balls I selected a
significant number ?
> m<-10
#number of red balls
> n<-90
#number of other balls (total pop-m)
> k<-20
#number of balls selected
> x<-0:k
#vector of successes
> 1-sum(dhyper(x,m,n,k)[1:5])
[1] 0.02546455
Compare NSCLC EGFR mutant
signature with a signature of
EGFR-transfected MCF-7 cells
• EGFR wt NSCLC
genes: 119
• MCF7 EGFR
genes: 1152
• Genes shared
between
MCF7/NSCLC
array platforms:
11079
• Genes shared
between
significance of
MCF7/NSCLC
gene signatures:
One-sided Fisher’s exact test overlap p<1E-10
44
Choi, Creighton, et al., PLoS ONE 2(11): e1226.
dhyper function in R
• EGFR mutant signature example:
– 11079 Genes shared between MCF7/NSCLC
array platforms
– 119 EGFR wt NSCLC genes
– 1162 MCF7 EGFR genes
– 44 genes shared between MCF7/NSCLC gene
signatures
> m<-119
#number of EGFR wt NSCLC genes
> n<-11079-119
#number of other genes
> k<-1162
#number of MCF7 EGFR genes
> x<-0:k
#vector of successes
> 1-sum(dhyper(x,m,n,k)[1:44])
[1] 1.265654e-14
General concepts of gene
expression analysis
General concepts of gene
expression analysis
• Low level analysis
– Processing image files.
– Normalization
– QC
• High level analysis
– Clustering
– Selecting differentially expressed genes.
– Enrichment analysis
Processing image files
• From CEL, GPR, or TXT files with image
information, want to generate gene
expression values
• For two color arrays (e.g. Stanford cDNA
arrays), can use Bioconductor
• For one channel array (e.g. Affymetrix),
can use dChip or Bioconductor
Normalization
• Purpose: To adjust the overall chip
brightness of the arrays to a similar level
• Methods:
– Two channel arrays
• ‘Loess’ normalization is good
– One channel arrays
• Total intensity normalization
• Quantile normalization
• Invariant set normalization
Before
Normalization
After
Normalization
www.dchip.org
High level analysis
• Selecting differentially expressed genes
– Account for multiple testing
• Clustering
– Hierarchical clustering
– Principal Components analysis
– K-means clustering
• Enrichment analysis or “Meta-analysis”
Selecting differentially expressed
genes
• Student’s t-test or ANOVA typically used
– Works best on log-transformed data
• Other criteria
– “fold change”
– Higher average signal intensity might indicate greater
abundance
• What p-value cutoff do you choose?
– No “right” answer
– Need to balance between false positives and false
negatives
• More stringent p-value, fewer false positives, more false
negatives
• Less stringent p-value, fewer false negatives, more false
positives
Multiple testing
• When evaluating thousands of genes,
some will show a nominally significant Pvalue by chance alone
• Somewhat like buying lots and lots of
lottery tickets: your chances of winning
greatly improve.
• Want to estimate false discovery rate
(FDR)
Multiple testing
• Estimate FDR by method from Storey et al.
(PNAS 2003 100:9440-5).
FDR =
[Number of genes on the array] X [nominal P-value]
[Number of genes significant with that P-value]
• Use permutation testing (e.g. SAM analysis,
Tusher et al., PNAS 2001 98:5116-21)
– Randomly assign sample labels and do the test
– Do it many times to get a distribution of false
positives
Cluster analysis
• Cluster analysis relates to grouping or
segmenting a collection of objects (e.g. genes or
samples) into subsets or "clusters", such that
those within each cluster are more closely
related to one another than objects assigned to
different clusters.
• Central to cluster analysis is the notion of degree
of similarity (or dissimilarity) between the
individual objects being clustered.
Cluster analysis
• Major methods of clustering include
hierarchical clustering, k-means clustering,
and principal components analysis (PCA)
• Heirarchical clustering most common for
expression profile data analysis
• “Cluster” and “JavaTreeview” public
software programs fomr Eisen et al.
(http://rana.lbl.gov/) are handy for cluster
analysis and/or generating heat maps
Hierarchical clustering – 3 methods
for measuring distance between
clusters
• Single linkage, using the members of each
cluster that are closest to each other
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Hierarchical clustering – 3 methods
for measuring distance between
clusters
• Complete linkage, using the members of
each cluster that are furthest from each other
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Hierarchical clustering – 3 methods
for measuring distance between
clusters
• Average linkage, using the average of
each cluster, most commonly used.
http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Widespread deregulation of gene
expression in cancer
• Gene
expression
profiling
identifies
different
subtypes of
breast cancer.
Sorlie et al. PNAS. 2003
100(14):8418-23
Final words on gene expression
profile analysis
• “All good roads lead to Rome.”
• i.e., there are many ways to go about
exploratory analysis, which can lead to the
same overall conclusions
• What’s important
– Be clear and concise about what you did (so
others can understand it and repeat it)
– Don’t try to fool anybody (including yourself)
Download