Statistical analysis of
expression data:
Normalization, differential
expression and multiple testing
Jelle Goeman
Outline
Normalization
Expression variation
Modeling the log Fold change
Complex designs
Shrinkage and empirical Bayes (limma)
Multiple testing (False Discovery Rate)
Measuring expression
Platforms
Microarrays
RNAseq
Common:
Need for normalization
Batch effects
Why normalization
Some experimental factors cannot be
completely controlled
Amount of material
Amount of degradation
Print tip differences
Quality of hybridization
Effects are systematic
Cause variation between samples and
between batches
What is normalization?
 Normalization =
An attempt to get rid of unwanted systematic
variation by statistical means
 Note 1: this will never completely succeed
 Note 2: this may do more harm than good
 Much better, but often impossible
Better control of the experimental conditions
How do normalization methods work?
 General approach
1. Assume: data from an ideal experiment would
have characteristic A
E.g. mean expression is equal for each sample
Note: this is an assumption!
2. If the data do not have characteristic A, change
the data such that the data now do have
characteristic A
E.g. Multiply each sample’s expression by a factor
Example: quantile normalization
 Assume:
“Most probes are not differentially expressed”
“As many probes are up and downregulated”
 Reasonable consequence:
The distribution of the expression values is identical for
each sample
 Normalization:
Make the distribution of expression values identical for
each sample
Quantile normalization in practice
 Choose a target distribution
 Typically the average of the measured distributions
 All samples will get this distribution after normalization
 Quantile normalization:
 Replace the ith largest expression value in each sample by the
ith largest value in the target distribution
 Consequence:
 Distribution of expressions the same between samples
 Expressions for specific genes may differ
Less radical forms of normalization
Make the means per sample the same
Make the medians the same
Make the variances the same
Loess curve smoothing
Same idea, but less change to the data
Overnormalizing
 Normalizing can remove or reduce true
biological differences
Example: global increase in expression
 Normalization can create differences that are not
there
Example: almost global increase in expression
 Usually: normalization reduces unwanted
variation
Batch effects
 Differences between batches are even stronger
than between samples in the same batch
 Note: batch effects at several stages
 Normalization is not sufficient to remove batcheffects
 Methods available (comBat) but not perfect
 Best: avoid batch effects if possible
Confounding by batch
Take care of batch-effects in experimental
design
Problem: confounding of effect of interest
by batch effects
Example: Golub data
Solution: balance or randomize
Expression variation
Differential expression
 Two experimental conditions
Treated versus untreated
 Two distinct phenotypes
Tumor versus normal tissue
 Which genes can reliably be called differentially
expressed?
 Also: continuous phenotypes
Which gene expressions are correlated with phenotype?
Variation in gene expression
 Technical variation
Variation due to measurement technique
Variability of measured expression from experiment to
experiment on the same subject
 Biological variation
Variation between subjects/samples
Variability of “true” expression between different
subjects
 Total variation
Sum of technical and biological variation
Reliable assessment
 Two samples always have different expression
 Maybe even a high fold change
 Due to random biological and technical variation
 Reliable assessment of differential expression:
 Show: fold change found cannot be explained by
random variation
Assessment of differential expression
Two interrelated aspects:
Fold change:
How large is the expression difference found?
P-value:
How sure are we that a true difference exists?
LIMMA:
Linear models for gene expression
Modeling variation
 How does gene expression depend on
experimental conditions?
 Can often be well modeled with linear models
 Limma:
linear models for microarray analysis
Gordon Smyth, W. and E. Hall Institute, Australia
Multiplicative scale effects
 Assumption: effects on gene expression work in
a multiplicative way (“fold change”)
 Example: treatment increases gene expression
of gene MMP8 by a factor 2
“2-fold increase”
 Treatment decreases gene expression of gene
MMP8 by a factor 2
“2-fold decrease”
Multiplicative scale errors
Assumption: variation on gene expression
works in a multiplicative way
A 2-fold increase by chance is just as
likely as a 2-fold decrease by chance
When true expression is 4, measuring 8 is
as likely as measuring 2
Working on the log scale
 When effects are multiplicative, log-transform!
 Usual in microarray analysis: log to base 2
 Remember: log(ab) = log(a)+log(b)
2 fold increase = +1 to log expression
2 fold decrease = -1 to log expression
 Log scale makes multiplicative effects symmetric
½ and 2 are not symmetric around 1 (= no change)
-1 and +1 are symmetric around 0 (= no change)
A simple linear model
 Example: treated and untreated samples
 Model separately for each gene
 Log Expression of gene 1: E1
 E1 = a + b * Treatment + error
 a: intercept = average untreated logexpression
 b: slope = treatment effect
Modeling all genes simultaneously
 E1 = a1 + b1 * Treatment + error
 E2 = a2 + b2 * Treatment + error
…
 E20,000 = a20,000 + b20,000 * Treatment +
error
 Same model, but
 Separate intercept and slope for each gene
 And separate sd sigma1, sigma2, … of error
Estimates and standard errors
 Gene 1: Estimates for a1, b1 and sigma1
 Estimate of treatment effect of gene 1
 b1 is the estimated log fold change
 standard error s.e.(b1) depends on sigma1
 Regular t-test for H0: b1=0:
 T = b1/s.e.(b1)
 Can be used to calculate p-values.
 Just like regular regression, only 20,000 times
Back to original scale
Log scale regression coefficient b1
Average log fold change
Back to a fold change: 2^b1
b1= 1 becomes fold change 2
b1 = -1 becomes fold change 1/2
Confounders
Other effects may influence gene
expression
Example: batch effects
Example: sex or age of patients
In a linear model we can adjust for such
confounders
Flexibility of the linear model
Earlier: E1 = a1 + b1 * Treatment + error
Generalize:
E1 = a1 + b1 * X + c1 * Y + d1 + Z + error
Add as many variables as you need.
Variance shrinkage
Empirical Bayes
So far: each gene on its own
20,000 unrelated models
Limma: exchange information between
genes
“Borrowing strength”
By empirical Bayes arguments
Estimating variance
 For each gene a variance is estimated
 Small sample size: variance estimate is
unreliable
Too small for some genes
Too large for others
 Variance estimated too small: false positives
 Variance estimated too large: low power
Large and small estimated variance
 Gene with low variance estimate
Likely to have low true variance
But also: likely to have underestimated variance
 Gene with high variance estimate
Likely to have high true variance
But also: likely to have overestimated variance
 Limma’s idea:
Use information from other genes to assess whether
variance is over/underestimated
True and estimated variance
Variance model
Limma has a gene variance model
All gene’s variances are drawn at random
from an inverse gamma distribution
Based on this model:
Large variances are shrunk downwards
Small variances are shrunk upwards
Effect of variance shrinkage
Genes with large fold change and large
variance
More power
More likely to be significant
Genes with small fold change and small
variance
Less power
Less likely to be significant
Limma and sample size
Shrinkage of limma only effective for small
sample size (< 10 samples/group)
Added information of other genes
becomes negligeable if sample size gets
large
Large samples: Doing limma is the same
as doing regression per gene
Differential expression in RNAseq
RNAseq data: counts
Gene id
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
ENSG0000
0110514
69
178
101
58
101
31
165
108
70
1
ENSG0000
0086015
115
52
86
88
146
84
59
85
86
0
ENSG0000
0115808
285
190
467
295
345
532
369
473
423
5
ENSG0000
0169740
502
184
363
195
403
262
225
332
136
3
ENSG0000
0215869
0
7
0
0
0
0
0
2
0
0
ENSG0000
0261609
20
31
76
20
25
158
23
18
23
1
ENSG0000
0169744
488
529
470
505
1137
373
1392
3517
192
1
ENSG0000
0215864
1
0
0
0
0
0
0
0
0
0
Modelling count data
Distinguish three types of variation
Biological variation
Technical variation
Count variation
Count variation is important for lowexpressed genes
Generally biological variation most
important
Overdispersion
Modelling count data: two stages
1. Model how gene expression varies from
sample to sample
2. Model how the observed count varies by
repeated sequencing of the same sample
Stage 2 is specific for RNAseq
Two approaches
 Approach 1: Model the count variation and the
between-sample variation
edgeR
Deseq
 Approach 2: Normalize the count data and
model only the biological variation
Voom + limma
 Approach 3: Model count variation only
Popular but very wrong!
Multiple testing
20,000 p-values
Fitting 20,000 linear models
Some variance shrinkage
Result:
20,000 fold changes
20,000 p-values
Which ones are truly differentially
expressed?
Multiple testing
Doing 20,000 tests: risk false positive
20,000 times
If 5% of null hypotheses is significant,
expect 1,000 significant by pure chance
How to make sure you can really trust the
results?
Bonferroni
Classical way of doing multiple testing
Call K the number of tests performed
Bonferroni: significant = p-value < 0.05/K
“Adjusted p-value”
Multiply all p-values by K, compare with 0.05
Advantages of Bonferroni
Familywise error control
=Probability of making any type I error < 0.05
With 95% chance, list of differentially
expressed genes has no errors
Very strict
Easy to do
Disadvantages of Bonferroni
Very strict
“No” false positives
Many false negatives
It is not a big problem to have a few false
positives
Do validation experiments later
False discovery rate
(Benjamini and Hochberg)
FDR = expected proportion of false
discoveries among all discoveries
Control of FDR at 0.05 means in the long
run experiments average about 5% type I
errors among the reported genes
Percentage: longer lists of genes are
allowed to have more errors
Benjamini and Hochberg by hand
1. Order the p-values small to large
Example: 0.0031, 0.0034, 0.02, 0.10, 0.65
2. Multiply the k-th p-value by m/k, where m is the number
of p-values, so
0.0031 * 5/1, 0.0034 * 5/2, 0.02 * 5/3, 0.10 * 5/4, 0.65 * 5/5
which becomes
0.0155, 0.0085, 0.033, 0.125, 0.65
3. If the p-values are no longer in increasing order, replace
each p-value by the smallest p-value that is later in the
list. In the example, we replace 0.0155 by 0.0085. The
final Benjamini-Hochberg adjusted p-values become
0.0085, 0.0085, 0.033, 0.125, 0.65
FDR warnings
FDR is susceptible to cheating
How to cheat with FDR?
Add many tests of known false null
hypotheses…
Result: reject more of the other null
hypotheses
Example limma results
Conclusion
Testing for differentially expressed genes
 Repeated application of a linear model
 Include all factors in the model that may
influence gene expression
 Limma: additional step “borrowing strength”
 Don’t forget to correct for multiple testing!