data_preprocessing

advertisement
Probe analysis and data preprocessing
1. Affymetrix Probe level analysis
1) Normalization
Constant, Loess, Rank invariant, Quantile normalization
2) Expression measure
MAS 4.0, LI-Wong (dChip), MAS 5.0, RMA
3) Background adjustment
PM-MM, PM only, RMA, GC-RMA
2. Statistical analysis of cDNA array
1) Image analysis
2) Normalization
3) Assess expression level
(A case study with Bayesian hierarchical model)
4) Experimental design
Source of variations; Calibration and replicate; Choice of reference
sample; Design of two-color array
3. Preprocessing
1) Data transformation
2) Filtering (in all platforms)
3) Missing value imputation (in all platforms)
1
From experiment to down-stream analysis
2
Statistical Issues in Microarray Analysis
Experimental design
Image analysis
Preprocessing
(Normalization, filtering,
MV imputation)
Data visualization
Regulatory network
Identify differentially
expressed genes
Clustering
Pathway
analysis
Classification
Integrative analysis & meta-analysis
3
Data Preprocessing
Preliminary analyses extract and summarize information
from the microarray experiments.
• These steps are irrelevant to biological discovery
• But are for preparation of meaningful down-stream analyses
for targeted biological purposes. (i.e. DE gene detection,
classification, pathway analysis…)
From scanned images
 Image analysis (extract intensity values from the images)
 Probe analysis (generate data matrix of expression profile)
 Preprocessing (data transformation, gene filtering and
missing value imputation)
4
1. Affymetrix probe level analysis
5
Overview of the technology
Hybridization
from Affymetrix Inc
6 .
Array Design
25-mer unique oligo
mismatch in the middle
nuclieotide
multiple probes (11~16) for each gene
from Affymetrix Inc
7 .
Array Probe Level Analysis
Background
adjustment
Normalization
Summarization
Normalization
Background
adjustment
Summarization

Give an expression measure for each probe set on
each array (how to pool information of 16 probes?)

The result will greatly affect subsequent analysis
(e.g. clustering and classification). If not modeled
properly,
=> “Garbage in, garbage out”
We will leave the discussion of “backgound adjustment” to the
last because there’re more new exciting & technical advances.
8
1.1 Normalization
The need for normalization:
gene1
gene2
gene3
gene4
gene5
gene6
gene7
gene8
gene9
gene10
gene11
gene12
gene13
gene14
gene15
gene16
average
array1
array2
3308
4947.5
2334
3155.5
2518
3738
8882.5
18937
5041
12956.5
7314.5
19013.5
3508.5
8164
2183
5121.5
4790
8082
1645.5
1794.5
1772
1963
1802.5
2186.5
14846
35811
9986
25293
11640.5
21508
3860
6530
5339.5 11200.09
Intensities of array 2
is intrinsically larger
than array 1. (about
two fold)
9
1.1. Normalization
Reason:
1. Different labeling efficiency.
2. Different hybridization time or
hybridization condition.
3. Different scanning sensitivity.
4. …..
 gs : the underlying expression level
x gs : the observed intensity
x g 1  f1 ( g 1 )
array 1
x g 2  f 2 ( g 2 )
array 2

x gS  f S ( gS )
array S
Normalization is needed in any microarray platform.
(including Affy & cDNA)
10
1.1. Normalization
Constant scaling
•
Distributions on each array are scaled to
have identical mean.
Suppose array 1 is the reference array,
x gs' 
•
x1
x gs
x s
Applied in MAS 4.0 and MAS 5.0 but
they perform the scaling after computing
expression measure.
11
1.1. Normalization
Constant scaling: Underlying reasoning
Assumption : x g1  f1 ( g1 )  1 g1
array 1
x g 2  f 2 ( g 2 )   2 g 2
array 2

x gS  f S ( gS )   S gS
array S
Suppose array 1 is the reference array,
'
x gs

1
x gs  1 gs , s  2,  , S
S
1 is not estimable but
1
x
can be estimated as 1
s
x s
if their overall expression are roughly the same ( 1   s )
12
1.1 Normalization
M-A plot
M=0
M
A
 x gs 

M  log
x 
 g1 
A  logx gs  x g1  2
13
1.1 Normalization
For non - differentially expressed genes :
θ gs  θ g1
 x gs 
  s gs 
 s 




M  log
 log
 log   constant
x 
  
 1 
 g1 
 1 g1 
M-A plot shows the need for non-linear normalization.
The normalization factor is a function of the expression
level.
14
1.1 Normalization
Fit Mˆ  fˆ ( A)
by ‘Lowess’ function in S-Plus
Replicate arrays
The same pool of sample is applied
Normalized Log ratio:
~
M  M  Mˆ
15
1.1 Normalization
Non-linear scaling: Underlying reasoning
Assumption : x g1  f1 ( g1 )  1 ( g1 ) g1
x g 2  f 2 ( g 2 )   2 ( g 2 ) g 2
array 1
array 2

x gS  f S ( gS )   S ( gS ) gS
array S
 θ gs 
 x gs 1 ( g1 ) 



hgs  log
 log

θ 
 x  ( ) 
g
1


 g1 s gs 
log relative
expression level
 x gs 
  g ( g1 ,  gs )
 log
x 
 g1 
 1 ( g1 ) 

where g ( g1 ,  gs )  log
  ( ) 
 s gs 
16
1.1 Normalization
Non-linear scaling: Underlying reasoning (cont’d)
Suppose we know the green genes
are non-differentially expressed
genes,
 1 ( g1 ) 
 x g1 ( g1 ) 



g ( g1 ,  gs )  log
 log
  ( ) 
 x ( ) 
 s gs 
 gs gs 
where  gs   g1
 x g1 ( A) 

gˆ ( g1 ,  gs )  log
 x ( A) 
 gs

 θ gs 
x 
  log gs   gˆ ( g1 ,  gs )
hˆgs  log
θ 
x 
 g1 
 g1 
The problem is: we usually don’t know which
genes are constantly expressed!!
17
1.1. Normalization
Loess (Yang et al., 2002)
•
Using all genes to fit a non-linear normalization
curve at the M-A plot scale. (believe that most genes
are constantly expressed)
•
Perform normalization between arrays pairwisely.
•
Has been extended to perform normalization globally
without selecting a baseline array but then is timeconsuming.
18
1.1. Normalization
•
Invariant set (dChip)
Select a baseline array (default is the one with
median average intensity).
•
For each “treatment” array, identify a set of genes
that have ranks conserved between the baseline and
treatment array. This set of rank-invariant genes are
considered non-differentially expressed genes.
•
Each array is normalized against the baseline array
by fitting a non-linear normalization curve of
invariant-gene set.

G  g : rank ( x gs )  rank ( xg1 )  d & l  Rank ( x gs  xg1 ) / 2  G  l
Tseng et al., 2001
19

1.1. Normalization
Invariant set (dChip)
Advantage:
More robust than fitting with all genes as in loess.
Especially when expression distribution in the arrays are
very different.
Disadvantage:
The selection of baseline array is important.
20
1.1. Normalization
Quantile normalization (RMA) (Irizarry2003)
1. Given n array of length p, form X of dimension
p×n where each array is a col21umn.
2. Sort each column of X to give Xsort.
3. Take the means across rows of Xsort and assign
this mean to each element in the row to get Xsort.
4. Get Xnormalized by rearranging each column of Xsort
to have the same ordering as original X.
X
Xsort
Xsort
Xnormalized
237
283
237
198
217.5
217.5
217.5
306
341
397
329
283
306
306
338
399
401
198
341
335
338
338
399
217.5
329
335
401
397
399
399
306
338
21
1.1. Normalization
Bolstad, B.M., Irizarry RA, Astrand, M, and Speed, TP
(2003), A Comparison of Normalization Methods for High
Density Oligonucleotide Array Data Based on Bias and
Variance Bioinformatics. 19(2):185-193
A careful comparison of different normalization
methods and concluded that quantile
normalization generally performs the best.
22
1.2. Summarize Expression Index
There’re multiple probes for one gene (11 PM and 11 MM) in U133.
How do we summarize the 24 intensity values to a meaningful
expression intensity for the target gene?
23
1.2. Summarize Expression Index
 MAS 4.0
For each probe set, (I: # of arrays, J: # of probes)
PMij-MMij= i + ij, i=1,…,I, j=1,…,J
i estimated by average difference
1. Negative expression
2. Noisy for low expressed genes
3. Not account for probe affinity
24
1.2. Summarize Expression Index
 dChip (DNA chips)
For each probe set, (I: # of arrays, J: # of probes)
PMij=j + ij + ij + ij
MMij=j + ij + ij
PMij - MMij= ij + ij, i=1,…,I, j=1,…,J
j = J, ij ~ N(0, 2)
1. Account for probe affinity effect, j.
2. Outlier detection through multi-chip analysis
3. Recommended for more than 10 arrays
Li and Wong (PNAS, 2001)
Multiplicative model: PMij - MMij= ij + ij (better)
Additive model: PMij - MMij= i + j + ij
25
1.2. Summarize Expression Index
 MAS 5.0
For each probe set, (I: # of arrays, J: # of probes)
log(PMij-CTij)=log(i)+ij, i=1,…,I, j=1,…,J
CTij=MMij
if MMij<PMij
less than PMij
if MMijPMij
i estimated by a robust average (Tukey biweight).
1. No more negative expression
2. Taking log adjusts for dependence of variance on the
mean.
26
1.2. Summarize Expression Index
 RMA (Robust Multi-array Analysis)
For each probe set, (I: # of arrays, J: # of probes)
log(T(PMij))= i + j + ij, i=1,…,I, j=1,…,J
T is the transformation for background correction and normalization
ij ~ N(0, 2)
1. Log-scale additive model
2. Suggest not to use MM
3. Fit the linear model robustly (median polish)
Irizarry et al. (NAR, 2003)
27
R2=0.85
20g
R2=0.95
20g
1.25g
1.25g
R2=0.97
Affymetrix Latin
square data
20g
1.25g
from Irizarry et al. (NAR, 2003)
28
Affymetrix Latin
square data
from Irizarry et al. (NAR, 2003)
29
1.3. Background Adjustment
 Direct subtraction: PM-MM
MAS4.0, dChip, MAS5.0
Assume the following deterministic model:
PM=O+N+S (O: optical noise, N: non-specifi binding)
MM=O+N
=> PM-MM=S>0
Is it true?
30
MM does not measure
background noise of PM

Yeast sample hybridized to human chip

If MM measures non-specific binding of
PM well, PMMM.

R2 only 0.5.
Many MM>PM

86 HG-U95A human chips, human
blood extracts

Two fork phenomenon at high
abundance

1/3 of probes have MM>PM
31
1.3. Background Adjustment
Reasons MM should not be used:
1. MM contain non-specific binding information
but also include signal information and noise
2. The non-specific binding mechanism not wellstudied.
3. MM is costly (take up half space of the array)
 Ignore MM
dChip has an option for PM-only model
In general, PM-only is preferred for both
dChip or RMA methods.
32
1.3. Background Adjustment
Consider sequence information
Naef & Magnasco, 2003
1. 95% of (MM>PM) have purine (A, G) in the middle
base.
2. In the current protocol, only pyrimidines (C, T) have
biotin-labeled florescence.
33
1.3. Background Adjustment
Fit a simple linear model:
 : probe affinity
1. C > G  T > A
2. Boundary effect
Naef & Magnasco, 2003
34
1.3. Background Adjustment
Some chemical explanation of the result:
See next page
PM
C
G
T
A
MM
G
C
A
T
labeling
Yes (+)
No (-)
Yes (+)
No(-)
Labeling
impedes
binding
Yes (-)
No
Yes (-)
No
Hydrogen
bonds
3 (+)
3 (+)
2
2
Sequence
specific
brightness
High
average
average
Low
35
1.3. Background Adjustment
Double strand
Remember from the first lecture:
• G-C has three hydrogen
bonds. (stronger)
• A-T has two hydrogen
bonds. (weaker)
From: Lodish et al. Fig 4-4
36
1.3. Background Adjustment
 GC-RMA
O: optical noise, log-normal dist.
N: non-specific binding

 log(N PM ) 

1

  N   PM ,  2 
  MM

 log(N MM ) 


  PM 



  h PM 
  MM 
  MM 

 
1
Wu et al., 2004 JASA
h: a smooth (almost linear) function.
: the sequence information weight
computed form the simple linear
model.
37
Criterion to compare diff. methods
• Accuracy
– In well-controlled experiment with spike-in
genes (such as Latin Square data), accuracy of
estimated log-fold changes compared to the
underlying true log-fold changes are concerned.
(only available in data with spike-in genes)
• Precision
– In data with replicates, the reproducibility (SD)
of the same gene in replicates is concerned.
(available in data with replicates)
38
39
40
GC-RMA
41
Fee
GUI
Flexibility to
programming
and mining
MAS 4.0
Commercial
Yes
No
Average
Difference
Manufacturer
default
dChip
Free
Yes
Some extra
tools
Li-Wong model
Biologists
MAS 5.0
Commercial
Yes
No
Robust average
of log difference
Manufacturer
default
RMAExpress Free
Yes
No
RMA
Biologists
Bioconductor Free
Some
Best
All of above
Statistician,
programmer
ArrayAssist
Yes
No
RMA, GC-RMA Biologists
Commercial
Audience
42
Probe level analysis in Bioconductor
(affy package)
Background
Methods
none
rma/rma2
mas
Normalization PM correction Summarization
Methods
Methods
Methods
quantiles
loess
contrasts
constant
invariantset
qspline
mas
pmonly
subtractmm
avgdiff
liwong
mas
medianpolish
playerout
43
A Simple Case Study
Latin Square Data
59 HG-U95A arrays
14 spike-in genes in 14 experimental groups
Expts
A
B
C
D
E
F
G
H
I
J
K
L
M, N, O, P
Q, R, S, T
Transcripts
1
2
0
0.25
0.25
0.5
0.5
1
1
2
2
4
4
8
8
16
16
32
32
64
64
128
128
256
256
512
512
1024
1024
0
3
0.5
1
2
4
8
16
32
64
128
256
512
1024
0
0.25
4
1
2
4
8
16
32
64
128
256
512
1024
0
0.25
0.5
5
2
4
8
16
32
64
128
256
512
1024
0
0.25
0.5
1
6
4
8
16
32
64
128
256
512
1024
0
0.25
0.5
1
2
7
8
16
32
64
128
256
512
1024
0
0.25
0.5
1
2
4
8
16
32
64
128
256
512
1024
0
0.25
0.5
1
2
4
8
9
32
64
128
256
512
1024
0
0.25
0.5
1
2
4
8
16
10
64
128
256
512
1024
0
0.25
0.5
1
2
4
8
16
32
11
128
256
512
1024
0
0.25
0.5
1
2
4
8
16
32
64
12
0
0.25
0.5
1
2
4
8
16
32
64
128
256
512
1024
13
512
1024
0
0.25
0.5
1
2
4
8
16
32
64
128
256
M, N, O, P are replicates and Q, R, S, T another replicates
http://www.affymetrix.com/analysis/download_center2.affx
44
A Simple Case Study
 Take the following two replicate groups.


M
1521m99hpp_av06.CEL
1521q99hpp_av06.CEL
Q
N
1521n99hpp_av06.CEL
1521r99hpp_av06.CEL
R
O
1521o99hpp_av06.CEL
1521s99hpp_av06.CEL
S
P
1521p99hpp_av06.CEL
1521t99hpp_av06.CEL
T
Use Bioconducotr to perform a simple
evaluation of different probe analysis
algorithms.
Note: This is only a simple demonstration. The
evaluation result in this presentation is not
conclusive.
45
A Simple Case Study
Average log intensities vs SD log intensities. (M, N, O, P)
MAS5.0
dChip
(PM/MM)
dChip
(PM only)
RMA
GC-RMA
(PM only)
GC-RMA
(PM/MM)
46
A Simple Case Study
Average log intensities vs SD log intensities. (Q, R, S, T)
MAS5.0
dChip
(PM/MM)
dChip
(PM only)
RMA
GC-RMA
(PM only)
GC-RMA
(PM/MM)
47
A Simple Case Study
Average pair-wise correlations
between replicates
MAS5
dChip (PM/MM)
dChip (PM-only)
RMA
GC-RMA(PM/MM)
GC-RMA(PM-only)
M, N, O, P
0.8930
Q, R, S, T
0.9002
0.9604
0.9940
0.9978
0.9621
0.9966
0.9978
0.9988
0.9993
0.9990
0.9994
Replicate correlation performance:
GCRMA(PM-only)>GC-RMA(PM/MM)>  RMA> 
dChip(PM-only)>>dChip(PM/MM)>>MAS5
48
A Simple Case Study
 RMA greatly improves dChip(PM/MM) but dChip(PMonly) model generally seems a little better than RMA.
 Average replicate correlations of RMA (0.9978) is a little
better than dChip(PM only) (0.9940 & 0.9966)
 dChip(PM only) suffers from a number of outlying genes in
the model.
Outlying genes that do not fit Li-Wong model
49
Conclusion:
1. Technological advances have been made to have smaller
probe size and better sequence selection algorithms to
reduce # of probes in a probe set. This will enable more
biologically meaningful genes on a slide and reduce the
cost.
2. Recent analysis advances have been focused on
understanding and modelling hybridization mechanisms.
This will allow a better use of MM probes or eventually
suggest to remove MMs from the array.
3. The probe analysis is relatively settled in the field. In the
second lab session next Friday, we will introduce dChip
and RMAexpress for Affymetrix probe analysis.
50
2. cDNA probe level analysis
51
cDNA Microarray Review
From Y. Chen et al. (1997)
52
cDNA Microarray Review
Probe (array) printing
1. 48 grids in a 12x4
pattern.
2. Each grid has 12x16
features.
3. Total 9216 features.
4. Each pin prints 3
grids.
53
cDNA Microarray Review
Probe design and printing
54
cDNA Microarray Review
From Y. Chen et al. (1997) 55
cDNA Microarray Review
Comparison of cDNA array and GeneChip
cDNA
GeneChip
Probe
preparation
Probes are cDNA fragments,
usually amplified by PCR and
spotted by robot.
Probes are short oligos
synthesized using a
photolithographic approach.
colors
Two-color
(measures relative intensity)
One-color
(measures absolute intensity)
Gene
representation
One probe per gene
11-16 probe pairs per gene
Probe length
Long, varying lengths
(hundreds to 1K bp)
25-mers
Density
Maximum of ~15000 probes. 38500 genes * 11 probes =
423500 probes
56
Advantage and disadvantage of
cDNA array and GeneChip
cDNA microarray
Affymetrix GeneChip
The data can be noisy and with variable
quality
Specific and sensitive. Result very
reproducible.
Cross(non-specific) hybridization can
often happen.
Hybridization more specific.
May need a RNA amplification
procedure.
Can use small amount of RNA.
More difficulty in image analysis.
Image analysis and intensity extraction
is easier.
Need to search the database for gene
annotation.
More widely used. Better quality of gene
annotation.
Cheap. (both initial cost and per slide
cost)
Expensive (~$400 per array+labeling and
hybridization)
Can be custom made for special species. Only several popular species are
available
Do not need to know the exact DNA
sequence.
Need the DNA sequence for probe
selection.
57
2.1. Image Analysis
 Identify spot area :
1. Each spot contains around 100100 pixels.
2. Spot image may not be uniformly and roundly
distributed.
3. Some software (like ScanAlyze or ImaGene) have
algorithms to “help” placing the grids and identify
spot and background area locally.
4. Still semi-automatic: a very time-consuming job.
 Extract intensities (data reduction) :
1. Aim to extract the minimum most informative
statistics for further analysis. Usually use the median
signal minus the median background.
2. Some spot quality indexes (e.g. Stdev or CV) will be
58
computed.
ScanAlyze
2.1. Image Analysis
59
2.1. Image Analysis
1. Input the number of rows and columns in each
sector; input the approximate location and
distances between spots.
2. May need to tilt the grids
3. Some local adjustments may be needed.
4. Once the spot grids are close enough to the real
spot physical location, computer image algorithms
will help to find the optimal spot area (spherical or
irregular shapes) and background area.
May take 10~30 minutes for an array. Usually the
biologists will do it.
60
2.1. Image Analysis
61
http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/
2.1. Image Analysis
Result file from image analysis
Summarized intensities for further analysis:
62
median(spot intensities)-median(background intensities)
2.2. Normalization
• Affymetrix
– Normalization done across arrays
– After normalization, the expression data matrix
shows absolute expression intensities.
• cDNA
– Normalization between two colors in an array.
– After normalization, the expression data matrix
shows comparative expression intensities (logratios).
63
2.2. Normalization
• Same sample on both dyes.
• Each point is a gene.
• Orange is one array and purple is
another array.
Calibration: apply the same
samples on both dyes (E. Coli
grown in glucose). Purple and
orange represent two replicate
slides.
A  log( Cy5)  log( Cy3) / 2
M  log( Cy5 Cy3)
64
2.2. Normalization
Normalization:
General idea:
 Dye effect : Cy5 is usually more bleached than Cy3.
 Slide effect
 The normalization factor is slide dependent.
 Usually need to assume that most genes are not
differentially expressed or up- and down-regulated
genes roughly cancel out the expression effect.
65
2.2. Normalization
Normalization:
Current popular methods:
 House-keeping genes : Select a set of non-differentially
expressed genes according to experiences. Then use
these genes to normalize.
 Constant normalization factor :
 Use mean or median of each dye to normalize.
 ANOVA model (Churchill’s group)
 Average-intensity-dependent normalization:


Robust nonlinear regression(Lowess) applied on whole
genome. (Speed’s group)
Select invariant genes computationally (rank-invariant
method). Then apply Lowess. (Wong’s group)
66
2.2. Normalization
Loess Normalization:
Pin-wise normalization using all the genes. It requires the assumption that
up- and down-regulated genes with similar average intensities (denoted A)
are roughly cancelled out or otherwise most genes remain unchanged.
M
A
67
From Dudoit et al. 2000
2.2. Normalization
Rank Invariant Normalization:
Rank-invariant method (Schadt et al. 2001, Tseng et al. 2001):
G  g : abs rank Cy3g   rank Cy5 g   5
Iterative selection :

 g : g S
S 0  g : Rank (Cy5 g )  Rank (Cy3g )  p * G & l  Rank Cy5 g  Cy3 g  / 2  G  l
Si
i 1
& Rank gSi1 (Cy5 g )  Rank gSi1 (Cy3g )  p * Si 1

Idea:


If a particular gene is up- or down- regulated, then its Cy5
rank among whole genome will significantly different
from Cy3 rank.
Iterative selection helps to select a more conserved
invariant set when number of genes is large.
68

2.2. Normalization
Rank Invariant Normalization:
Data: E. Coli. Chip, ~4000 genes, from Liao lab.
Blue points are invariant genes selected by rank-invariant method. Red
69
curves are estimated by Lowess and extrapolation.
Data Truncation
Data Truncation
• In cDNA microarry, the
intensity value is between
0~216=65536.
• Measurement of low intensity
genes are not stable.
• Extremely highly expressed
genes can saturate.
• For example, we can truncate
genes with intensity smaller
than 200 or larger than 65000.
70
2.3. Assess Expression Level
Approaches to assess expression level:
Single slide:
1. Normal model (Chen et al. 1997)
2. Gamma model with empirical Bayes approach
(Newton et al. 2001)
With replicate slides:
 Traditional t-test.
 ANOVA model (Kerr et al. 2000)
 Permutation t-test (Dudoit et al. 2000)
Hierarchical structure:
 Linear hierarchical model (Tseng et al. 2001)
71
2.3. Assess Expression Level
Single slide analysis:
1. Chen et al.(1997)
Cy5 ~ N (  R ,  R ), Cy3 ~ N (  G ,  G )
R G
c

 R G
Infer  R / G by Cy5/Cy3 at known house - keeping genes
in the experiment .
2. Newton et al. (1999)
Assume gamma distributi ons to avoid negative values.
Suggest to infer ratio conditiona l on absolute intensitie s.
72
2.3. Assess Expression Level
Case study: (Tseng et al. 2001)
125-gene project: each gene is spotted four times
Calibration: E. Coli grown in acetate v.s. actate
C1S1~2
E. Coli grown in glucose v.s. glucose
C2S1~4, C3S1~2, C4S1~3
Comparative: E. Coli grown in acetate v.s. glucose
R1S1~2, R2S1~2
4129-gene project: each gene is singly spotted
Calibration: E. Coli grown in acetate v.s. actate
C1S1~2, C2S1~2
Comparative: E. Coli grown in acetate v.s. glucose
R1S1~2, R2S1~2
73
2.3. Assess Expression Level
Hierarchical structure in experiment design
Original mRNA pool
Reversed transcription
& labeling
C1
C2
Hybridize onto
different slides
C1S1
C1S2
C2S1
C2S2
74
2.3. Assess Expression Level
75
2.3. Assess Expression Level
Basics of Bayesian Analysis
x1 ,  xn ~ f ( x |  )
 : the unknown underlying parameter of interest
x1 ,  xn : observed data
Bayes rule :
g (θ|x1, xn ) 
f ( x1 ,  xn |  )  h( )
 f ( x , x
1
n
|  )  h( ) d
h( ) is a prior distributi on(knowled ge) of 
g (θ|x1, xn ) is called the posterior distributi on of 
Meaning how much we can say about  given the data
76
2.3. Assess Expression Level
Baysian Hierarchical Model
xseg ~ N ( eg , g )
2
 g 2 ~ k~g 2  k 2
eg ~ N ( g ,  g 2 )
p  g   1
 g 2 ~ h~g 2  h 2
x : normalized logratios, log( Cy5 Cy3).
e : experiment , s : slide, g : gene.
 : underlying true expression level.
 2 : experiment al(cultura l) variation .
 2 : slide variation . h and k are adjustable parameters .
Original mRNA pool

Note only x' s are observed.
e.g. ((0.75, 0.67), (0.45, 0.51))

2
C1
C2
2
C1S1
C1S2
C2S1
77
C2S2
2.3. Assess Expression Level
~
g
g
2
 eg
2
g
h
~
2
g
g
2
k
x seg
xseg ~ N ( eg , g )
2
 g 2 ~ k~g 2  k 2
eg ~ N ( g ,  g 2 )
 g 2 ~ h~g 2  h 2
p  g   1
78
2.3. Assess Expression Level
How to specify the prior?
Empirical Bayes:
2
2
Use empirical data to help specify hyperparam eters (~g , ~g ).
A common ver sion of EB is usually achieved by integratin g out
intermedia te parameters and maximize the resulting marginal
likelihood .
~ 2 , ~ 2 | X )  p(~ 2 , ~ 2 ,  , , 2 ,  2 | X ) ddd 2 d 2
max
p(


~2 ~2
 ,
~
~
It is hard to implement in this three - layer hierarchic al model.
79
2.3. Assess Expression Level
Another version of EB:
Estimate parameters in prior from empirical data.
~ 2  g ,s ,e ( y gse  y g e ) 2 G( S  1) E (between slides variation )
~ 2  g ,e ( y g e  y g  ) 2 G( E  1)
(between experiment variation )
Note :
• Since there are thousands of genes, the common problem of reusing
data and getting over - confident prior in EB is alleviated .
• The estimation of ~ 2 is more conservati ve.
Original mRNA pool
2
C1
2
C1S1
C2
C1S2
C2S1 80 C2S2
2.3. Assess Expression Level
Getting prior distribution:
(when we have calibration experiments)
Calibration
(normal vs normal)
Comparative
(cancer vs normal)
Original mRNA pool
Original mRNA pool
2
2
C1
C2
C1
2
C2
2
C1S1
C1S2
C2S1

C2S2

C1S1
C1S2
C2S1

C2S2

~g 2  ( S  1) * E *ˆg 2  ˆA 2 ( S  1) * E  1
~g 2  E * ˆ g 2  ˆ A 2 E  1
ˆg 2  s ,e ( xgse  xg e ) 2 ( S  1) E
ˆ g 2  e x g e 2 E
ˆA   g , s ,e ( xgse  xg e ) G ( S  1) E
2
2
ˆ A 2   g ,e x g e 2 GE
((0.75, 0.67), (0.45, 0.51))
81
2.3. Assess Expression Level
MCMC for hierarchical model:
1. Compute ( eg ) (0)  xeg
2.
3.
 g 2  eg ~
2
~2
(



)

h

e eg g
 g  eg ,  g 2
 E2  h1
2



g 

~ N  g ,

E 

se
E
4.
 g 2  eg , x seg ~
2
~2
(
x


)

k

 seg eg
g
j 1 s 1
 s2  s
1
5.
 eg x seg , g , g ,  g
2
2
E k
 se xeg  g 2   g 2 g
~ N
,
 s  2  2
e g
g



2 
 g 
 g 2 g 2
s e g
2
82
2.3. Assess Expression Level
95% probability
interval of the
posterior distribution
of the underlying
expression level.
83
2.4. Experimental design
• Biological variation
Technical variations:
• Amplification
• Labeling
• Hybridization
• Pin effect
• Scanning
84
2.4.1 Calibration and replicate
(i) Calibration:

Use the same sample on both dyes for hybridization.

Calibration experiments help to validate experiment
quality and gene-specific variability.
Comparative:
Tumor vs Ref
Calibration:
Ref vs Ref
85
2.4.2. Calibration and replicate
(ii) Replicates: (replicate spots, slides)

Multiple-spotting helps to identify local
contaminated spots but will reduce number of
genes in the study.

Multi-stage strategy: Use single-spotting to
include as many genes as possible for pilot study.
Identify a subset of interesting genes and then
use multiple-spotting.

Replicate spots and slides help to verify
reproducibility on the spot and slide level.
86
2.4.2. Calibration and replicate
Biological replicate
87
From Yang, UCSF
2.4.2. Calibration and replicate
Technical replicate
88
From Yang, UCSF
2.4.2. Calibration and replicate
(iii) Reverse labelling:
Sample A
Sample B
Advantage:
• Cancel out linear normalization scaling and
simplifies the analysis. However, the linear
assumption is often not true.
• Help to cancel out gene-label interactions if it
exists.
89
2.4.3. Choice of reference sample
Different choices of reference sample:
a) Normal patient or time 0 sample in time course
study
b) Pool all samples or all normal samples
c) Embryonic cells
d) Commercial kit
Ideally we want all genes expressed at a constant
moderate level in reference sample.
90
2.4.4. Design issue
From Yang, UCSF
91
2.4.4. Design issue
Design issues:
(a) Reference design
(b) Loop design
(c) Balance design
Reference sample is redundantly
measured many times.
92
(c)
v samples with v+2 experiments
v samples with 2v experiments
93
See Kerr et al. 2001
Conclusion of cDNA array
1. Affymetrix GeneChip is more preferred if available.
2. Unlike GeneChip, cDNA array data is usually more noisy and
careful quality control (replicates and calibration) is important. But
occasionally custom arrays are needed for some specific research.
3. Analysis of cDNA microarray is also applicable to other two-color
technology such as array CGH and similar two-color oligo arrays.
4. Conservative “Reference design” is usually more robust although
it’s not statistically most efficient.
94
3. Data preprocessing
95
3.1. Data Truncation and Transformation
Transformation
1. Logarithmic transformation (most commonly used)
-- tend to get an approximately normal distribution
-- should avoid negative or 0 intensity before transformation
2. Square root transformation
-- a variance-stabilizing transformation under Poisson model.
3. Box-Cox transformation family
4. Affine transformation
5. Generalized-log transformation
Details see chapter 6.1 in Lee’s book; Log10 or Log2
transformation is the most common practice.
96
3.2. Filtering
Filter is an important step in microarray analysis:
1. Without filtering, many genes are irrelevant to the
biological investigation and will add noise to the analysis.
(among ~30,000 genes in the human genome, usually only
around 6000 genes are expressed and varied in the
experiment)
2. But filtering out too many genes will run the risk to
eliminate important biomarkers.
3. Three common aspects of filtering:
1. Genes of bad experimental quality.
2. Genes that are not expressed
3. Genes that do no fluctuate across experiments.
97
3.2. Filtering
Filter out genes with bad quality in cDNA array: Outputs
from imaging analysis usually have a quality index or flag
to identify genes with bad quality image.
Three common sources of bad quality probes:
1. Problematic probes: probes with non-uniform intensities.
2. Low-intensity probes: genes with low intensities are
known to have bad reproducibility and hard to verify by
RT-PCR. Normally genes with intensities less than 100 or
200 are filtered.
3. Saturated probes: genes with intensities reaching
scanner limit (saturation) should also be filtered.
For Affymetrix and other platforms, each probe (set) also has
a detection p-value, quality flag or present/absent call. 98
3.2. Filtering
Filtering by quality index:
different array platform and image analysis have different format
low intensity
99
3.2. Filtering
Filtering by quality index:
Array 1
Array 2



Array 1
Array 2

Gene 1
342.061
267.247

NA
Gene 2
72.2798
54.2583

49.4225
Gene 3
69.6987
73.8338

58.7938
Gene 4
163.73
197.419

196.236
Gene 5
136.412
140.536

146.344
Gene 6
131.405
96.128

93.5549
:
:
:
:
:
:
:
:
Gene G-2
763.445
936.445

768.63
Gene G-1
NA
34.7477

30.3535
12.5406
13.648

15.9003
Gene G
Array S
Array S
NA: not applicable
Missing values due to bad quality, low or
100
3.2. Filtering
Filter genes with low information content:
1. Small standard deviation (stdev)
2. Small coefficient of variation (CV: stdev/mean)
150
gene 2
150
gene 1
130
125
stdev=6.45
CV=0.053
50
50
intensity
100
120
intensity
100
stdev=6.45
CV=0.29
115
30
25
15
0
0
20
1.0
1.5
2.0
2.5
samples
3.0
3.5
4.0
1.0
1.5
2.0
2.5
3.0
3.5
4.0
samples
Note: CV is more reasonable for original intensities.
But for log-transformed intensities, stdev is enough
101
Why?
3.2. Filtering
Gene filtering
A simple gene filtering routine (I usually use) before downstream analyses:
1. Take log (base 2) transformation.
2. Delete genes with more than 20% missing values among all
samples.
3. Delete genes with average expression level less than, say
α=7 (27=128). For Affymetrix and most other platforms,
intensities less than 100-200 are simply noises.
4. Delete genes with standard deviation smaller than, say β=0.4
(20.4=1.32, i.e. 32% fold change).
5. Might adjust β so that the number of remaining genes are
computationally manageable in downstream analysis. (e.g.
around ~5000 genes)
102
3.2. Filtering
Sample filtering (detecting problematic slides)
Compute correlation matrix of the samples:
1. Arrays of replicates should have high correlation. (m,n,o,p are
replicates and q,r,s,t are another set of replicates)
2. A problematic array is often found to have low correlation
with all the other arrays.
3. Heatmap is usually plotted for better visualization.
103
3.2. Filtering
Diagnostic plot by correlation matrix
White: high correlation
Dark gray: low correlati
m,n,o,p
q,r,s,t
104
3.3. Missing Value Imputation
Reasons of missing values in microarray:







spotting problems (cDNA)
dust
finger prints
poor hybridization
inadequate resolution
fabrication errors (e.g. scratches)
image corruption
Many down-stream analysis require a complete data.
“Imputation” is usually helpful.
105
3.3. Missing Value Imputation
Array 1
Array 2

Array S
Gene 1
342.061
267.247

NA
Gene 2
72.2798
54.2583

49.4225
Gene 3
69.6987
73.8338

58.7938
Gene 4
163.73
197.419

196.236
Gene 5
136.412
140.536

146.344
Gene 6
131.405
96.128

93.5549
:
:
:
:
:
:
:
:
Gene G-2
763.445
936.445

768.63
Gene G-1
NA
34.7477

30.3535
12.5406
13.648

15.9003
Gene G
It is common to have ~5% MVs in a study.
5000(genes)50(arrays) 5%=12,500
106
3.3. Missing Value Imputation
Existing methods
• Naïve approaches
– Missing values = 0 or 1 (arbitrary signal)
– missing values = row (gene) average
• Smarter approaches have been proposed:
– K-nearest neighbors (KNN)
– Regression-based methods (OLS)
– Singular value decomposition (SVD)
– Local SVD (LSVD)
– Partial least square (PLS)
– More (Bayesian Principal Component Analysis, Least Square
Adaptive, Local Lease Square)
Assumption behind: Genes work cooperatively in groups. Genes
with similar pattern will provide information in MV imputation.
107
3.3. Missing Value Imputation
KNN.e & KNN.c
• considerations:
– number of neighbors (k)
– distance metric
– normalization step
randomly missing datum
?
Expression
• choose k genes that are most
“similar” to the gene with
the missing value (MV)
• estimate MV as the
weighted mean of the
neighbors
Arrays
108
3.3. Missing Value Imputation
KNN.e & KNN.c
• parameter k
?
– 10 usually works (5-15)
Expression
• distance metric
– euclidean distance (KNN.e)
– correlation-based distance
(KNN.c)
• normalization?
Arrays
– not necessary for euclidean
neighbors
– required for correlation
neighbors
109
3.3. Missing Value Imputation
OLS.e & OLS.c
• regression-based approach
• KNN+OLS
• algorithm:
– choose k neighbors (euclidean or correlation; normalize
or not)
– the gene with the MV is regressed over the neighbor
genes (one at a time, i.e. simple regression)
– for each neighbor, MV is predicted from the regression
model
– MV is imputed as the weighed average of the k
predictions
110
3.3. Missing Value Imputation
OLS.e & OLS.c
y1 = a1 + b1 x1
?
y2 = a2 + b2 x2
Expression
randomly missing datum
Arrays
y = w1 y1 + w2 y2
111
3.3. Missing Value Imputation
SVD
• Algorithm
– set MVs to row average (need a starting point)
– decompose expression matrix in orthogonal
components, “eigengenes”.
– use the proportion, p, of eigengenes corresponding to
largest eigenvalues to reconstruct the MVs from the
original matrix (i.e. improve your estimate)
– use EM approach to iteratively imporove estimates of
MVs until convergence
• Assumption:
– The complete expression matrix can be welldecomposed by a smaller number of principle
components.
112
3.3. Missing Value Imputation
LSVD.e & LSVD.c
• KNN+SVD
– choose k neighbors (euclidean or correlation;
normalize or not)
– Perform SVD on the k nearest neighbors and
get a prediction of the missing value.
113
3.3. Missing Value Imputation
PLS
• PLS: Select linear combinations of genes (PLS
components) exhibiting high covariance with the
gene having the MV.
– The first linear combination of genes has the highest
correlation with the target gene.
– The second linear combination of genes had the greatest
correlation with the target gene in the orthogonal space of
the first linear combination.
• MVs are then imputed by regressing the target gene
onto the PLS components
114
3.3. Missing Value Imputation
Types of missing mechanism:
1. Missing completely at random (MCAR)
Missingness is independent of the observed values
and their own unobserved values.
1. Spot missing due to mis-printing or dust particle.
2. Spot missing due to scratches.
2. Missing at random (MAR)
Missingness is independent of the unobserved data
but depend on the observed data.
• Missing not at random (MNAR)
MIssingness is dependent on the unobserved data
1. Spots missing due to saturation or low expression.
115
Currently imputation methods only work for MCAR, not MNAR.
Which missing value imputation method to use in expression
profiles: a comparative study and two selection schemes
Guy N. Brock1, John R. Shaffer2, Richard E. Blakesley3,
Meredith J. Lotz3, George C. Tseng2,3,4§
BMC Bioinformatics, 2008
116
3.3. MV imputation comparative study
Data set
Full Dim.
Used Dim.
Category
Organism
Expression Profiles
13412 x
40
5635 x 40
multiple exposure
H. sapiens
diffuse large B-cell lymphoma
2000 x 62
2000 x 62
multiple exposure
H. sapiens
colon cancer and normal colon tissue
Baldwin
(BAL)
16814 x
39
6838 x 39
time series, non-cyclic
H. sapiens
epithelial cellular response to L.
monocytogenes
Causton
(CAU)
4682 x 45
4616 x 45
multiple exposure x time
series
S.
cerevisiae
response to changes in extracellular
environment
2986 x 155 multiple exposure x time
series
S.
cerevisiae
cellular response to DNA-damaging
adgents
Alizadeh
(ALI)
Alon (ALO)
Gasch (GAS)
6152 x
174
Golub (GOL)
7129 x 72
1994 x 72
multiple exposure
H. sapiens
acute lymphoblastic leukemia
Ross (ROS)
9706 x 60
2266 x 60
multiple exposure
H. sapiens
NCI60 cancer cell lines
Spellman,
AFA
(SP.AFA)
Spellman,
ELU
(SP.ELU)
7681 x 18
4480 x 18
time series, cyclic
S.
cerevisiae
cell-cycle genes
7681 x 14
5766 x 14
time series, cyclic
S.
cerevisiae
cell-cycle genes
9 data sets: multiple exposure, time series or both
7 methods were compared: KNN, OLS, LSA, LLS, PLS, SVD,
117
BPCA
3.3. MV imputation comparative study
.
Global-based methods (PLS, SVD, BPCA):
Estimate the global structure of the data to impute MV.
Neighbor-based methods (KNN, OLS, LSA, LLS):
Borrow information from correlated genes (neighbors).
Intuitively global-based methods require that dimension
reduction of the data can be effectively performed.
We define an entropy measure for a given data D to
determine how well the dimension reduction of the data can
be done: (i are the eigenvalues)
k

e( D )  
i 1
pi  i
pi log pi
log( k )

k
l 1
,
l
Entropy low: the first few eigenvalues dominate and the data can be
118
reduced to low-dimension effectively.
3.3. MV imputation comparative study
LRMSE is the performance
measure, the lower the better.
KNN, OLS, LSA, LLS are
neighbor-based methods and
work better in low-entropy
data sets.
PLS and SVD are global-based
methods and work better in
high-entropy data sets.


LRMSE Dˆ (j ;kM) i , D j 0 i  i e( D j )   j   ijk ,
119
3.3. MV imputation comparative study
Data
set
BAL
CAU
ALO
GOL
SP.EL
U
GAS
SP.AF
A
ROS
ALI
Simulation II
Entro Optimal
py
0.819 LSA (38), LLS
(12)
0.838 LLS (45), LSA
(5)
0.872 LSA (50)
0.876 LSA (50)
0.909 LLS (41), BPCA
(9)
0.911 LSA (50)
0.94 LLS (40), BPCA
(10)
0.944 LSA (50)
0.947 LSA (50)
Simulation III
EBS
Accura Optimal
cy
LSA (50) 76%
LSA (9), LLS
(1)
LSA (50) 10%
LLS (10)
STS
LSA (10)
Accurac
y
90%
LSA (10)
0%
LSA (50) 100%
LSA (50) 100%
LSA (50) 0%
LSA (10)
LSA (10)
LLS (10)
LSA (10)
LSA (10)
BPCA (10)
100%
100%
0%
LSA (50) 100%
LSA (50) 0%
LSA (10)
LLS (9),
BPCA (1)
LSA (10)
LSA (10)
LSA (10)
BPCA (10)
100%
10%
LSA (10)
LSA (10)
Overall
100%
100%
67%
LSA (50) 100%
LSA (50) 100%
Overall
65%
Three methods (LSA, LLS, BPCA) performed best but none dominated.
Performed two selection schemes (entropy-based scheme and self-training
120
scheme) to select the best imputation method.
Download