Introduction to high-throughput data analysis Guanghua (Andy) Xiao July 24, 2012

advertisement
Introduction to
high-throughput data analysis
Guanghua (Andy) Xiao
July 24, 2012
University of Texas Southwestern Medical Center
Overview
•
•
•
•
Introduction to high-throughput data
Gene expression microarray data
Data Preprocessing
Data analysis
1.
2.
3.
4.
5.
Gene clustering and gene function prediction
Identify differently expressed genes
Pathway/gene set enrichment analysis
Constructing gene network
Classification/prediction
• Some real data analysis examples
University of Texas Southwestern Medical Center
Traditional Biology
University of Texas Southwestern Medical Center
Systems Biology
University of Texas Southwestern Medical Center
Northern blot vs microarray
University of Texas Southwestern Medical Center
Data Matrix
CL20041 CL20041
10909AA 11002AA
1007_s_at 10.4
10.2
1053_at 6.37
7.53
117_at
6.44
7.04
121_at
8.99
8.92
1255_g_at 4.36
4.73
1294_at 7.79
8.1
1316_at 6.16
6.41
1320_at 5.09
5.05
1405_i_at 8.38
8.82
1431_at 4.37
4.34
1438_at 7.87
8.3
1487_at 7.62
8.1
1494_f_at 7
7.35
……
……
CL20041
11003AA
10.22
6.11
6.77
9.03
4.54
8.19
6.43
4.86
8.47
4.39
7.73
8.05
7.25
CL200411 CL20041 CL20041
10100AA 11010AA 11013AA
10.7
9.63
12.05
6.61
6.45
6.65
7.61
7.07
7.04
9.13
8.87
8.85
4.73
4.48
4.55
8.17
8.24
8.14
6.47
6.15
6.92
5.13
4.97
4.96
7.6
8.24
7.36
4.44
4.26
4.37
7.74
6.73
7.47
8.26
7.25
7.7
7.29
6.79
7.11
CL20041
11017AA
9.42
6.87
6.95
8.74
4.78
7.92
6.07
5.02
8.37
4.38
7.4
7.9
7.04
CL20041
11018AA ……
10.75
7
6.56
8.56
4.47
7.89
6.13
5.01
7.11
4.25
7.34
7.64
6.92
University of Texas Southwestern Medical Center
Application
•
Genetics
Genome-wide association study (GWAS), copy number variation (CNV)
Technique: genome-wide single nucleotide polymorphism (SNP) array
•
Epigenetics
Definition: mechanisms that causes gene expression changes without
changes in the underlying DNA sequence
DNA methylation, histone methylation/acetylation,
and transcriptional factor binding
Technique: Chromatin immunoprecipitation on chip (ChIP-chip)
Chromatin immunoprecipitation – sequencing (ChIP-Seq)
•
Gene/exon expression
Technique: gene expression array, exon expression array, RNA-Seq
•
•
Protein expression
Compound screening
University of Texas Southwestern Medical Center
SOME EXAMPLES
University of Texas Southwestern Medical Center
Discover disease subtypes
Golub et al, 1999, Science
Yeoh et al, 2002, Cancer Cell
University of Texas Southwestern Medical Center
Development of Diagnostic Tests for Cancer
From Ramaswamy, N Engl J Med, 2004
University of Texas Southwestern Medical Center
Identify tumor driver genes
Weir et al, Nature, 2007
Akavia et al, Cell, 2010
University of Texas Southwestern Medical Center
GENE EXPRESSION
MICROARRAY
University of Texas Southwestern Medical Center
Microarray Platforms
University of Texas Southwestern Medical Center
Microarray Platforms
•
•
•
Nature Biotechnology 24, 1151 - 1161 (2006)
Published online: 8 September 2006 | doi:10.1038/nbt1239
The MicroArray Quality Control (MAQC) project shows
inter- and intraplatform reproducibility of gene expression
measurements
MAQC Consortium
University of Texas Southwestern Medical Center
Platforms
Manufacturer
Code
Protocol
Platform
# of Probesets
Applied Biosystems
ABI
One-color microarray
Affymetrix
AFX
One-color microarray
Human Genome Survey
Microarray v2.0
HG-U133 Plus 2.0
GeneChip
Two-color microarray
Whole Human Genome
Oligo Microarray, G4112A
43,931
AG1
One-color microarray
Whole Human Genome
Oligo Microarray, G4112A
43,931
Eppendorf
EPP
One-color microarray
GE Healthcare
GEH
One-color microarray
Illumina
ILM
One-color microarray
NCI_Operon
NCI
Two-color microarray
DualChip Microarray
CodeLink Human Whole
Genome, 300026
Human-6 BeadChip, 48K
v1.0
Operon Human Oligo Set
v3
Applied Biosystems
TAQ
TaqMan assays
>200,000 assays available
Panomics
Gene Express
QGN
GEX
QuantiGene assays
StaRT-PCR assays
2,600 assays available
1,000 assays available
Agilent
AGL
32,878
54,675
294
54,359
47,293
37,632
University of Texas Southwestern Medical Center
1,004
245
207
University of Texas Southwestern Medical Center
Spotted Array
University of Texas Southwestern Medical Center
Affymetrix Array
•
•
•
•
•
Probes = 25 nt sequences
Probe sets = 11 to 20 probes
corresponding to a particular
gene or EST
Sequence data obtained from
dbEST, GenBank, and RefSeq.
Draft assembly of Human
Genome (NCBI Build 31) used
to assess sequence orientation
and quality.
Probes selected from the 600
bases most proximal to the 3’
end of each transcript.
University of Texas Southwestern Medical Center
Illumina BeadArray
• Advantages:
– High quality
– Less expensive
– Need much less RNA
sample
• Features:
– Multiple replicates
– Negative control: nonspecific beads to control the
background noise level
University of Texas Southwestern Medical Center
Microarray Data Preprocessing
University of Texas Southwestern Medical Center
Data Preprocessing
• The purpose of preprocessing microarray data: to
minimize the system variation while retaining full
biological variation. This is a critical step for obtaining
valid results.
• Steps:
Image acquisition and Feature extraction: the process of defining the array
features, which correspond to the probe spots found in the microarray,
so that the hybridization intensity of each spot can be determined.
Background Correction: To remove the signal intensity from non-specific
hybridization and fluorescence from the solid support.
Normalization: To correct for systematic differences between samples on
the same chip, or between chips, which do not represent true biological
variation between samples.
Summarization: from probe level or bead level intensity to summarize to
gene level intensity.
University of Texas Southwestern Medical Center
The top 10 genes based on an analysis of the Beer
et al. data using different processing methods.
RMA
Symbol
CD8B
SLC2A1
CCR2
PLD3
RAFTLIN
HNRPL
BCL2
PFKP
STX1A
INPP5D
P
0.0697
0.127
0.2111
0.2224
0.2433
0.2787
0.3106
0.3223
0.361
0.369
MAS5
Symbol
P
0.0245
RAFTLIN
0.0465
TMSB4X
0.0559
SLC2A1
0.3312
IHPK1
0.3414
MLL
0.3492
NP
0.4494
PRKACB
0.4787
<NA>
0.5528
E2F4
0.5846
P2RX5
Beer et al.
Symbol
P
0.0187
RAFTLIN
0.0993
NP
0.2968
KLHDC3
0.3808
TMSB4X
0.4084
CXCL3
0.4441
SELP
0.5026
STX1A
0.5068
SEC31L1
0.5355
PRKACB
0.5571
PBXIP1
Owzar K, et al, CCR, 2008
University of Texas Southwestern Medical Center
Another example
The overlaps among top 50 genes
NonNormaliza
tion
Mean
Normaliza
tion
Mean
T
T
NonNormalization
Normalization
Mean
T
Mean
T
50
2
18
3
50
2
5
50
2
50
T: Student t-test
et al, 2003 Medical Center
University of TexasXie
Southwestern
23
Microarray Data Analysis
1. Predicting gene functions
University of Texas Southwestern Medical Center
Predicting gene functions (Guilt by association)
Co-expression network
Microarray expression data
•
Cell cycle
Unsupervised learning (cluster)
Hierarchical clustering
K-means clustering
Self Organizing Map (SOM)
•
CDC3
CLB4
CDC16
UNK1
Supervised learning (classifier/predictor)
K-nearest Neighbor (KNN)
Linear Discrimanant Analysis(LDA)
Support Vector Machine (SVM)
RPT1
RPN3
RPT6
Eisen et al (PNAS 1998)
UNK2
Protein degradation
Fraser AG, Marcotte EM - A probabilistic view of gene function - Nat Genet. 2004 Jun;36(6):559-64
University of Texas Southwestern Medical Center
Microarray Data Analysis
2. Identify differentially expressed genes
University of Texas Southwestern Medical Center
Identify DE genes
•
•
Goal: Which genes express differently under different conditions
(normal v.s tumor tissues)
Quantification of cDNA microarray data
•
•
•
Suppose that Xij are normalized log-ratio of two channel
intensities for gene i on array j; j = 1, ..., n and i = 1, ...,G
Hypothesis: μi = E(Xij) = 0 for each i
•
•
Ranking genes by test statistics
Deciding cut-off value
University of Texas Southwestern Medical Center
Identify DE genes
• Ranking criteria
M-statistic: average Xi for each gene i over j replications (fold
changes)
Student t-statistic: ti = Xi /vi
SAM t-statistic: (Tusher et al, 2001): Si = Xi /(vi + a0)
B-statistic (Lonnstedt et al, 2001): empirical Bayes statistic
James-Stein estimator for standard deviation (Cui et al, 2005)
Some non-parametric approaches
University of Texas Southwestern Medical Center
Multiple Testing and FDR
•
Controlling the family-wise error rate (FWER)
Bonferroni correction
Hochberg FWER procedure
Other corrections
Controlling FWER for microarray analysis is too conservative
•
False discovery rate (Benjamini and Hochberg 1995, Xie 2005):
FDR(d) = FP(d)/TP(d)
University of Texas Southwestern Medical Center
SAM plot
University of Texas Southwestern Medical Center
Microarray Data Analysis
3. pathway/gene set enrichment analysis
University of Texas Southwestern Medical Center
Pathway Analysis
University of Texas Southwestern Medical Center
Gene Set Enrichment Analysis
University of Texas Southwestern Medical Center
Microarray Data Analysis
4. Constructing gene network
University of Texas Southwestern Medical Center
Construct gene network
University of Texas Southwestern Medical Center
Microarray Data Analysis
5. Prediction and classification
University of Texas Southwestern Medical Center
Leukemia Diagnosis
n’
-1
+1
+1
-1
m
{-yi}
{yi}, i=1:m
Golub et al, Science Vol 286:15 Oct. 1999
University of Texas Southwestern Medical Center
MDACC tumor Sample clustering
Cluster 3
Cluster 1
University of Texas Southwestern
Center
Cluster Medical
2
Kaplan-Meier plots for two clusters
University of Texas Southwestern Medical Center
Predicting Drug Response
University of Texas Southwestern Medical Center
Background
• Lung cancer is the leading cause of death from cancer
among both men and women in the United States
• Median survival time for Non-small Cell Lung Cancer: 8
months
• Cancer patients have different response to chemotherapy
due to the complexity and uniqueness of each tumor’s
genetic profile
• Personalized medicine: match the right therapeutic
regimen with the right individual
University of Texas Southwestern Medical Center
MTS Drug Sensitivity Assay
No
Cells
Cells No Drug
Drug
Day 0: Plate cells
No
Cells
Day 1: Add drug
...
Day 5: Add MTS and read plate
Cells 1,000 – 4,000/well
Š Octuplicate measurements, one per row
Š 96-well plate assays are repeated at least 3 times
Dehydrogenase enzymes found in
metabolically active cells catalyze the
formation of formazan product, which is
measured at 490nm absorbance
University of Texas Southwestern Medical Center
Lung Cancer Cell Lines Show Different In Vitro drug sensitivity
HCC1171
IC50 = 127 μM
> 1000-fold
HCC827
IC50 = 0.04 μM
IC50 : drug concentration causing 50% growth inhibition
University of Texas Southwestern Medical Center
In vitro drug sensitivity
Vinorelbine
Pemetrexed
Peloruside.A
Paclitaxel
Irinotecan
Gemcitabine
Gefitinib
Etoposide..VP.16.
Erlotinib
Docetaxel
Cisplatin
Carboplatin
0.01
0.1
1
10
100
1000
IC50
In vitro sensitivity to 12 therapeutic drugs were determined for 45 lung cancer cell lines.
University of Texas Southwestern Medical Center
Drug coverage
University of Texas Southwestern Medical Center
Drug Selection
random
optimal
H2887
University of Texas Southwestern Medical Center
Prediction Methods
•
•
•
•
Filtering genes
Clustering genes
Principal components of the cluster
Classification/Regression tree method to predict the
drug sensitivity
• Leave-one-out cross validation
University of Texas Southwestern Medical Center
Prediction Results
Accuracy
Sensitivity
Specificity
NPR
PPR
Cisplatin
0.84
0.86
0.84
0.50
0.97
Gefitinib
0.80
0.50
0.85
0.33
0.92
Paclitaxel
0.84
0.89
0.70
0.91
0.64
Vinorelbine
0.86
0.92
0.50
0.92
0.50
EGFR
0.93
0.63
0.94
0.92
0.71
Leave-one-out cross-validation results for supervised classification of drug
sensitivity or for EGFR status using mRNA expression profiles.
NPR: negative predictive rate, PPR: positive predictive rate.
For modeling we used extreme cases (<0.2 and >0.8) and for testing on all
of the cell line data. For EGFR, tumor cell lines were divided into those with
EGFR TK domain mutations and those with wild type EGFR.
University of Texas Southwestern Medical Center
Drug Selection
random
predicted
optimal
p=0.0008
University of Texas Southwestern Medical Center
Ovarian Cancer Example
University of Texas Southwestern Medical Center
Example of Over-fitting and Good Fitting
Over fitting
Good fitting
Overfitting function is not generalize enough to unknown data.
University of Texas Southwestern Medical Center
Over-fitting
•
The training data contains information about the regularities in
the mapping from input to output. But it also contains noise
The target values may be unreliable.
There is sampling error. There will be accidental
regularities just because of the particular training cases
that were chosen.
•
When we fit the model, it cannot tell which regularities are real
and which are caused by sampling error.
So it fits both kinds of regularity.
If the model is very flexible it can model the sampling
error really well. This is a disaster.
University of Texas Southwestern Medical Center
Feature
Validation
N genes/features
Split data into 3 sets:
training, validation, and test set.
m2
M samples
m1
1) For each feature subset, train predictor on
training data.
2) Select the feature subset, which performs
best on validation data.
Repeat and average if you want to
reduce variance (cross-validation).
3) Test on test data.
m3
University of Texas Southwestern Medical Center
Feature Validation
•
Divide the total dataset into three subsets:
Training data is used for learning the parameters of the
model.
Validation data is not used of learning but is used for
deciding what type of model and what amount of
regularization works best.
Test data is used to get a final, unbiased estimate of how
well the network works. We expect this estimate to be
worse than on the validation data.
•
•
•
We could then re-divide the total dataset to get another unbiased
estimate of the true error rate.
Leave One Out Validation (Using all data as a training set)
Independent testing data is the best way to test the prediction
model!
University of Texas Southwestern Medical Center
A real analysis example 1:
developing prognostic signature of nonsmall cell lung cancer (NSCLC)
University of Texas Southwestern Medical Center
Hierarchical Clustering of the Robust Gene Set (RGS)
Group1
Group 2
University of Texas Southwestern Medical Center
Unsupervised clustering groups
are associated with survival
--- Group1
--- Group 2
University of Texas Southwestern Medical Center
Gene sets enriched analysis
ER-Negative signature (Nature 2002)
Enriched in Group 1
ER-Positive signature (Nature 2002)
Enriched in Group 2
University of Texas Southwestern Medical Center
Gene Set Enrichment Analysis
Enriched in group 1 (worse prognosis group)
University of Texas Southwestern Medical Center
Gene Set Enrichment Analysis
Enriched in group 2 (better survival group)
University of Texas Southwestern Medical Center
FFPE training to frozen sample testing prediction results
(442 NSCLCs from Shedden et al Nat Med, 2008 )
University of Texas Southwestern Medical Center
FFPE training to frozen sample testing prediction results
University of Texas Southwestern Medical Center
FFPE to frozen samples prediction results within each stage
University of Texas Southwestern Medical Center
A real analysis example 2:
Construct gene network in NSCLC
University of Texas Southwestern Medical Center
Construct gene network in NSCLC
(B)
(A)
Predict MDACC data
SARG
NKX2-1
HOP
pv=0.00025 n= 209
SFTPB
MBIP
(C)
Predict Tomida et al
MLF1IP
TTC37
PRC1
pv=0.023
n= 117
University of Texas Southwestern Medical Center
References
• Bioinformatics course in MD Anderson:
http://bioinformatics.mdanderson.org/MicroarrayCourse/index.html
• Terry Speed's Class Homepages :
http://www.stat.berkeley.edu/users/terry/Classes/index.html
• Iowa State bioinformatics course:
http://www.public.iastate.edu/%7Ednett/microarray/microarray.shtml
• Dov Stekel, Microarray Bioinformatics
• Richard Simon, et al Design and analysis of DNA
microarray investigations
• Rober Gentleman et al. Bioinformatics and
computational biology solutions using R and
Bioconductor
University of Texas Southwestern Medical Center
Microarray v.s mRNA-Seq
Mortazavi et al, Nat Methods 2008
University of Texas Southwestern Medical Center
Microarray v.s mRNA-Seq
Slide from Wing Wong, Stanford
University of Texas Southwestern Medical Center
Reproducibility of RNA-Seq
Mortazavi et al, Nat Methods 2008
University of Texas Southwestern Medical Center
Microarray v.s mRNA-Seq
•
•
Sequencing assays provide digital measures of sequence
abundance, i.e., read counts. In contrast, microarrays provide
analog measures of sequence abundance, i.e., fluorescence
intensities.
Microarrays depend on the design of chips
--- Annotation problems
--- Aligning probes across platforms
----Hard to deal with alternative splicing
----Can not identify new transcripts
•
mRNA-Seq
--- Measure transcriptome composition
--- Relatively easy to deal with alternative splicing
--- Discover new exons or genes
University of Texas Southwestern Medical Center
Download