Microarray Design and analysis: Linear model analyses Edinburgh 13-15 Oct 2008

advertisement
Microarray Design and
analysis:
Linear model analyses
Edinburgh 13-15 Oct 2008
Luc Janss
University of Aarhus
luc.janss@djf.au.dk
Luc Janss
Edinburgh Oct 2008
Course overview & general
remarks
Overview:
 1. Normalisations (LOWESS, variance, bit on
background). Practicals: start with R
 2. Simple analyses (t-tests, clustering), false
positives and the use of False Discovery Rate
 3. More advanced analyses using linear models
Luc Janss
Edinburgh Oct 2008
Course overview & general
remarks
Assumed background:
 (virtually) none in statistics
 (virtually) none in the use of R
 Some general biological background on expressions,
arrays, array experiments.
Luc Janss
Edinburgh Oct 2008
Course overview & general
remarks
Types of microarrays
 One-sample per array (affymetrix type)
 Two-samples per array (e.g. spotted
cDNA type).
Here: focus on the two-sample arrays
 Generally more noisy, more
complications in design and analysis
 Most is also applicable (sometimes
simpler …) to one-sample arrays.
Luc Janss
Edinburgh Oct 2008
1st Morning: Normalisations
 Logging, means and medians
 MA-plot and dye-intensity bias
 Correction using lowess fit
 Choice of correction basis
 Background correction
 Effect on variation at low intensities
 Handling of negative values
 Few remarks on Quality Control
 Introduction to the statistical package “R”
Luc Janss
Edinburgh Oct 2008
The raw data, use of
means/medians and
logging
Luc Janss
Edinburgh Oct 2008
Raw data
Spot localisation: size may vary; quality of
localisation may vary
Count of nr pixels in spot area and their
colour (intensity) values
Total, mean or median of colour-values of
pixels
Use of the median intensity is safest because it
is not affected by a single extreme pixel
Similar for a number of background spots
And all for green (cy3) and red (cy5)
Luc Janss
Edinburgh Oct 2008
Gene expression from two channels:
plotted cy3 and cy5 raw data
Expectation: a cloud around the 45º line because
most genes are neutral
Problem: although scale goes
up to 15,000 the majority of
points is <1,500 (here: 92%).
15000
12000
Any trends and relationships
here say more about the last
8% of data than about the
first 92%….
9000
6000
3000
0
0
3000
92%
6000
9000
12000
15000
Luc Janss
Edinburgh Oct 2008
Gene expressions fromn two
ree
g
channels after logging
nd
a
d
4.5
4
e
or
M
3.5
g
ree
Eq
n
re
l
a
u
3
2.5
2
1.5
M
or
ed
r
e
1
1
1.5
2
2.5
3
3.5
4
4.5
Luc Janss
Edinburgh Oct 2008
Logging, mean and medians
 So generally, we work on log scale
 Extreme values pose less problems
 Scale is better spread
 On log scale we can fairly safely use means and
linear models
 For items obtained before logging (e.g., intensities,
background corrections), medians are safer to use
 The median is the middle point of a data range, and is not
affected by an extreme value in a tail
Luc Janss
Edinburgh Oct 2008
The M-A plot and lowess
normalisation
Luc Janss
Edinburgh Oct 2008
M = difference of log intensities
More convenient plot: MA-plot
1.5
More green
1
0.5
0
1
1.5
2
2.5
3
3.5
4
4.5
-0.5
-1
-1.5
More red
M is log(fold change)
-2
A = average of log intensities
Luc Janss
Edinburgh Oct 2008
Logging and M-values
 The difference of logs = log(ratio) of original values
 So M can be seen to be the log(fold change):
 M<0: ratio <1 or down
 M>0: ratio >1 or up
 Often log base 2 is used, then:
 Every “M” unit + (or -) represents doubling (or halving) of
expression ratio
Luc Janss
Edinburgh Oct 2008
Computations using M and A
Let C1 = Intensity in Channel 1 (usually cy3)
Let C2 = Intensity in Channel 2 (usually cy5)
Log to base 2 is:
LC1 = log(C1)/log(2) in R also log2( )
LC2 = log(C2)/log(2)
A=(LC1+LC2)/2
M=LC2-LC1 => corresponds to log2(cy5/cy3)
Back transformation from M,A -> LC1,LC2
LC1 = A – 0.5M
LC2 = A + 0.5M
Luc Janss
Edinburgh Oct 2008
Some observation on the MA
plot:
 The average M is not 0 but <0 (ratio < 1)
 Could it really be that more genes are down?
 Or could the slide simply have too much red?
 (A dye swap or controls could show)
 As the average intensity increases spots become
more red
 Although it could happen to be so, more likely this is an
artefact that needs correction
 The spread of M values increases
Luc Janss
Edinburgh Oct 2008
So are these the
correct top spots?
1.5
1.5
Up
1
1
0.5
0.5
0
1
1.5
2
2.5
3
3.5
4
4.5
0
1
-0.5
-0.5
-1
-1
-1.5
-2
Dw
Up
1.5
Dw
2
2.5
3
3.5
4
-1.5
-2
Or better these?
Luc Janss
Edinburgh Oct 2008
4.5
LOWESS (or LOESS) correction:
pulls the MA plot “straight”
1.5
1
0.5
0
1
1.5
2
2.5
3
3.5
4
4.5
-0.5
-1
-1.5
-2
And also works for
curved plots ….Luc Janss
Edinburgh Oct 2008
LO(W)ESS: background
 Takes a portion of your data
 Portion needs to be given by
user (e.g. 0.3 to 0.5)
 Data needs sorting or results
are sorted!
 Fits straight line
 Moves up 1 data point,
repeats
Windows using portion=0.50
Luc Janss
Edinburgh Oct 2008
Effect of LOWESS portion:
Using low portion parameter
gives little smoothing
Using high portion parameter
gives large smoothing
For M-A plot 0.3-0.5 usually gives good results: follows
main trend but does not fit every aberration
Luc Janss
Edinburgh Oct 2008
Further issues for LOWESS fit
 LOWESS can be applied to various sets of genes
 All “real” genes on the array (not controls etc.): assumes that
there is no excess of either up or down genes (on average
“neutral”)
 Controls
 A subset of the “real” genes which we believe to be neutral
 Of course this can crucially affect the genes selected
in our top lists
Luc Janss
Edinburgh Oct 2008
Scenario’s when adjusting
lowess to controls or subsets
Raw data
Lowess adjusted
Genes
Controls
Limits for selecting top genes
Luc Janss
Edinburgh Oct 2008
Adjusting to controls or
subsets ….
We can also do some statistics to see whether level of
controls is different from genes
 A problem of using controls or subsets is that
extrapolation outside range of controls is dangerous:
creates loss of data.
Data Range where lowess fit can Data
lost
be made on controls
lost
Luc Janss
Edinburgh Oct 2008
A bit on background
correction
Luc Janss
Edinburgh Oct 2008
Background and Foreground
500
Foreground
300
100
Background
hybridisation
What is a better assessment of ratio between red and green?
• 500/300 = 1.67
• (500-100)/(300-100) = 2.00
Luc Janss
Edinburgh Oct 2008
Problems with background correction
Red
Green
Background
Red/Green
500
300
0
1.67
500
300
100
2.00
500
300
200
3.00
500
300
250
5.00
500
300
299
>200
500
300
310
“-19”
Luc Janss
Edinburgh Oct 2008
Problems with background correction
 If background gets close to one (or both) of the
foreground signals, large ratio’s can arise
 Can increase spread of M values in lower intensity region
dramatically
 If background gets above one of the foreground signals
a negative ratio is the result
 These cannot be logged
 And actually have no meaning
 With background correction M’s get inaccurate when
one foreground is close to the background
Luc Janss
Edinburgh Oct 2008
Solutions to work with background
correction
 Points which get negative after correction:
 Need to be dropped -> loss of data
 Arbitrarily set at 1 -> gives arbitrary ratio
 Arbitrarily reduce entire background level -> also results in
arbitrary ratio values
 The safest remains to drop negative points, but this remains
undesirable (also good points get lost, see following).
 Background computed on blank/empty spots is often
lower and gives less loss of data
Luc Janss
Edinburgh Oct 2008
Illustration
500
300
100
This spot would be lost after
background correction, but surely
looks interesting!
With arbitrary setting for green at
background+1, we would compute
a ratio of ~400.
Without background correction
we would compute a ratio of ~5.
Luc Janss
Edinburgh Oct 2008
Background correction?
 Many people now advocate not correcting for
background
 It introduces large variation/extreme ratio’s
 Handling of signals close to background remains unsatisfactory
 The safe option would be to drop points below background, but
this also drops interesting up/down genes
 Background corrections are applied on the raw (nonlogged) intensities, which introduces further
problems of extreme values.
Luc Janss
Edinburgh Oct 2008
A bit on Quality Control
Luc Janss
Edinburgh Oct 2008
Quality control
 Quality control can be done on several levels:
 By the scanner/image analysis software
 Using statistics within slide
 Using statistics between slides
Luc Janss
Edinburgh Oct 2008
Quality control from the
image analysis
Can report such features as:
 Standard deviation of pixel intensities
 Large s.d. can indicate non-round spots, bad spot localisation,
“dust” particles, etc.
 # of pixels in the spot: very small # may actually be
a dust particle in an otherwise faint spot
Luc Janss
Edinburgh Oct 2008
Within slide analysis for QC
 Having duplo (or more) measurements per gene
helps to filter out bad spots
 Analysis of sector effects (usually each sector is
associated with a print tip)
 Analysis of control and empty/blank spots
Luc Janss
Edinburgh Oct 2008
Between slide QC
 Slides with large variation or extreme values
 Slides with poor within-slide repeatability for
duplicate gene expressions
 Check for repeatability of genes between slides,
filter out poor duplicates on gene level
 Slides with dye effects (bad swap) (see Woolfinger
model tomorrow)
Luc Janss
Edinburgh Oct 2008
Second half day
Luc Janss
Edinburgh Oct 2008
Outline afternoon
 Problem of variance heterogeneity
 Look at multiple slides
 boxplots, variance normalisation
 First simple statistic for analysis: t-test
 Multiple testing problems
 False negatives and positives, False Discovery Rate
 Permutation approach (SAM)
 Clustering
Luc Janss
Edinburgh Oct 2008
Variance heterogeneity
 Variance heterogeneity can lead to unsatisfactory
choices for top-genes
 Simple “cut-off” favours high-variable genes
 Statistics (like t-test) favour low-variable genes
 Variance heterogeneity can be a problem
 between slides, between genes, between sectors within a slide,
….
 Note: variance estimates based on small numbers (such
as for genes) are very inaccurate
-> extra complication addressed Wednesday
Luc Janss
Edinburgh Oct 2008
Illustration variance
heterogeneity
Low variable
gene
High variable gene with same
average up-regulation
M
A
• Simple cut-off would pick up high variable gene in some replicates, and
never low variable gene
• Statistical analysis would sooner pick up the low variable gene because
statistics say it has higher chance to really deviate from M=0
Luc Janss
Edinburgh Oct 2008
1. Boxplots and
variance normalisation
between slides
Luc Janss
Edinburgh Oct 2008
25% of data
(Q1)
1.5* IQ range
75% of data
(Q3)
Median (Q2)
Interquartile range
Variances between slides: boxplots
1.5* IQ range is expected to contain
all your data points; points outside
this range are “outliers” and
individually drawn
Luc Janss
Edinburgh Oct 2008
Variance standardisation
 Common approach for standardisation of yi’s is:
( yi – mean) / (standard deviation)
 This approach is sensible to outliers and extreme
values: maybe those are your up/down genes and you
do not like to penalise those too much!
 More robust/less affected by extreme values:
 Use median instead of mean
 Use absolute deviations instead of squared deviations (as in
variance and standard deviation)
Luc Janss
Edinburgh Oct 2008
Scale normalisation by Yang et al.
(2002. Nucliec Acids Research, 30, 4 e15)
 Recommends normalisation of M values
 Considers within array the absolute deviation of each
observation from median:
 | yi – median |
 Determines the “scale” (MAD) as the median value
of all | yi – median |
 You can standardise to an average scale, computed
as geometric mean of MAD values
Luc Janss
Edinburgh Oct 2008
Illustration scale normalisation by Yang et
al.
Plots by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
A Simple Example
Gene
1
2
3
4
5
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
8
7
3
1
9
15
2
6
5
13
9
7
5
2
6
13
15
8
9
11
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Determine Channel Medians
Gene
1
2
3
4
5
medians
Sl1Cy3
8
7
3
1
9
7
Sl1Cy5
15
2
6
5
13
6
Sl2Cy3
9
7
5
2
6
6
Sl2Cy5
13
15
8
9
11
11
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Subtract Channel Medians
Gene
1
2
3
4
5
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
1
9
3
2
0
-4
1
4
-4
0
-1
-3
-6
-1
-4
-2
2
7
0
0
Thi s i s t he da t a a f t e r me di a n c e nt e r i ng.
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Find Median Absolute
Deviations
Gene
1
2
3
4
5
MAD
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
1
9
3
2
0
-4
1
4
-4
0
-1
-3
-6
-1
-4
-2
2
7
0
0
2
4
1
2
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Find Scaling Constant
Gene
1
2
3
4
5
MAD
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
1
9
3
2
0
-4
1
4
-4
0
-1
-3
-6
-1
-4
-2
2
7
0
0
2
4
1
2
C = ( 2 *4 *1 *2 ) 1 /4 = 2
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Find Scaling Factors
Gene
1
2
3
4
5
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
1
9
3
2
0
-4
1
4
-4
0
-1
-3
-6
-1
-4
-2
2
7
0
0
S c a l i ng
F a c t or s
2
2
2
4
2
1
2
2
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Scale Normalize the
Median Centered Data
Gene
1
2
3
4
5
Slide1Cy3 Slide1Cy5 Slide2Cy3 Slide2Cy5
1
4. 5
6
2
0
-2. 0
2
4
-4
0. 0
-2
-3
-6
-0. 5
-8
-2
2
3. 5
0
0
Th i s i s t h e d a t a a f t e r me d i a n c e n t e r i n g a n d
s c a l e n o r ma l i z i n g .
Slide by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Quantile Normalisation
(Bolstad, et al. 2003, Bioinformatics 19
2:185-193)
 Commonly applied to affymetrix array data
 Forces the distribution between slides to be exactly
the same by:
 Substituting all biggest values between slides by the average
of all biggest values
 Similarly for the one-by-biggest values …
 …. two-by-biggest values
 …
 …. all smallest values between slides
Luc Janss
Edinburgh Oct 2008
Illustration quantile
normalisation
Plots by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
Remarks quantile
normalisation
 Quantile normalisation gives statistical difficulties
because observations on different slides are no longer
independent; each “observation” is now a mean from
multiple slides.
 It looks a bit like non-parametric methods based on
ranks of observations (within slide) – but then I’d
prefer such a direct rank-based analysis instead of
this semi- non-parametric quantile normalisation.
Luc Janss
Edinburgh Oct 2008
2. Accounting for variance
heterogeneity between
genes using t-test
Luc Janss
Edinburgh Oct 2008
T-test (or Student’s test)
 T-test computes a standardised effect accounting for
number of observations (signal/noise):
T-test = Effect
√n
St. dev.
 If n bigger or st.dev smaller: t-test larger
 Significant for > 2 to 3 or < -2 to –3
(depends also on nr. used for computing st.dev.)
 Can be further transformed in a p-value
Luc Janss
Edinburgh Oct 2008
Example t-test
Gene
Avg Effect
(on n=4)
Std. Dev
(on n=8)
T-test
P-value
A
+2.00
1.8
2.22
0.045
B
+2.00
1.3
3.08
0.011
• The p-values can be computed for instance in Excel using t.distr or in R using dt
• Usually “two-sided” testing is used because you do not known a-priori the sign of a
gene effect, so for 0.05 significance level you apply 0.025 in each tail.
Luc Janss
Edinburgh Oct 2008
3. Multiple testing
problems; False Discovery
Rate
Luc Janss
Edinburgh Oct 2008
Statistical errors (1)
 Statistical errors are inevitable: if you want a zero risk
test, never accept an alternative (skip the
experiment)
 But then you make “false negative”errors ….
 “Null hypothesis”: starting point
 e.g. “no effect of gene”
 Test for “gene effect”:
 If there is really no effect but test says so: false pos.
 If there is an effect but test does not say so: false neg.
Luc Janss
Edinburgh Oct 2008
Statistical errors (2)
 Statistical significance (or α) level is the fraction of false
positives, can be determined by experimenter.
 Usually 1-5%: you have to risk some to gain some, and “usually”
1-5% false positives is found OK
 The level of false negatives can vary and may be high (depends on
power of experiment: poor power means many true effects remain
undetected = false neg.)
 For microarray analysis 1-5% false positives is still a lot:
e.g. 100 out of 10.000 are “false” with α=0.01.
Luc Janss
Edinburgh Oct 2008
Accounting for false positives:
Example with 10,000 genes tested
α=1%
α=0.1%
Total genes tested
10,000
10,000
Expected false pos.
100
10
Actual # significant
150
50
Excess (probable real pos.)
50
40
“False Discovery Rate”
67%
20%
Luc Janss
Edinburgh Oct 2008
Histogram of p-values
showing excess of positives
Excess positives
5%
5% bins, if there’s nothing you expect 5% of tests to end up
in each bin – and therefore also 5% < 0.05
Luc Janss
Edinburgh Oct 2008
What significance level to choose?
 This depends on what you want to do with
the results!
 If you will do validation by RT-PCR, further experiments, and more
validation studies, you might be happy with up to 50% false ones in
your set.
 If follow-up is very costly (drug testing, …), or when filing a patent
application you probably want more scrutiny, e.g. <10% false ones
in your set.
 Power of experiment also puts limits:
 Poor power e.g. 10 genes with only 1 false pos.
 Good power e.g. 50-100 genes with only 1-2 false pos.
Luc Janss
Edinburgh Oct 2008
Approaches to determine FDR
 Many approaches work on the histogram of
p-values to determine excess. A way to
formalise this is:




Order the n p-values
Compute for i’th p-value i*α/n
Set cut-off just before i*α/n > i’th p-value
You can compute ‘q’ values as an aid to select using FDR
 Based on permutation of the data, such as in SAM
 This depends on assumption that most genes are neutral
 We will see more of this in the practicals….
Luc Janss
Edinburgh Oct 2008
4. Something on clustering
Luc Janss
Edinburgh Oct 2008
Clustering
 Groups your data
 For microarrays usually applied to group genes, e.g.
 Genes with similar profile in time
 Genes with similar up/down regulation in condition x
 Can not replace statistical analysis
 Is usually applied after statistical analysis to cluster only genes
which have an effect
 Including all data would be too much “noise”
 There are many methods; some require you to specify apriori the number of groups you want.
Luc Janss
Edinburgh Oct 2008
Microarray Data for Clustering
genes
object
1
2
3
.
.
.
n
attribute
1
2
3
4.7 3.8 5.9
5.2 6.9 3.8
5.8 4.2 3.9
.
.
.
.
.
.
.
.
.
6.3 1.6 4.7
time points
... m
... 1.3
... 2.9
... 4.4
.
.
.
.
. .
... 2.0
estimated expression levels
Janss
Slide by Dan Nettleton, IowaLuc
State
Univ.
Edinburgh Oct 2008
Illustration:
3 clusters for time profiles of top
400 genes
Plots by Dan Nettleton, Iowa State
Univ.
Luc Janss
Edinburgh Oct 2008
How do we cluster (1) ?
obj e c t
1
2
3
a t t r i but e
1
2
3
4. 7 3. 8 5. 9
5. 2 6. 9 3. 8
5. 8 4. 2 3. 9
.
.
.
.
.
.
.
.
.
m
. 1.3
. 2. 9
. 4. 4
So me h o w we h a v e t o c o mp u t e wh e t h e r
t wo r o ws l o o k “ s i mi l a r ” : u s e e . g .
c o r r e l at i o n o r “ di s t anc e ”
Luc Janss
Edinburgh Oct 2008
How do we cluster (2) ?
 Correlation based:
 Groups on similarity irrespective of means, for instance
groups constant high & constant low profiles
 PCA is also correlation based
 Distance based: the actual data has to be close
 Will not group a constant high & a constant low profile
Luc Janss
Edinburgh Oct 2008
Two main approaches for
clustering
 “K-means”
 Groups records in predefined # of groups iteratively so that
distances within groups are smallest
 Starts randomly with 1 record per group, and then adds
following records
 Hierarchical (bottom-up)
 Groups two “closest” records in one group
 Repeats to find next closest records/groups
 The groups get bigger allowing to make a dendogram
Luc Janss
Edinburgh Oct 2008
Example of K Medoids Clustering
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Start with K Medoids
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Assign Each Point to Closest Medoid (1)
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Assign Each Point to Closest Medoid (2)
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Assign Each Point to Closest Medoid (3)
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Find New Medoid for Each Cluster
New medoids have smallest average
distance to other points in their cluster.
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Reassign Each Point to Closest Medoid
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Reassign Each Point to Closest Medoid
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Find New Medoid for Each Cluster
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Reassign Each Point to Closest Medoid
No reassignment is needed,
so the procedure stops.
Slide by Dan Nettleton, Iowa State Univ.
Luc Janss
Edinburgh Oct 2008
Illustration of hierarchical
clustering
9.7
14
Janss
Plot by Dan Nettleton, IowaLuc
State
Univ.
Edinburgh Oct 2008
Hierarchical clustering
2 clusters
3 clusters
4 clusters
5 observ.
Janss
Plot by Dan Nettleton, IowaLuc
State
Univ.
Edinburgh Oct 2008
3rd half day
Luc Janss
Edinburgh Oct 2008
Outline
 Linear models to analyse designs with multiple
treatments or times
 Further refinement of handling heterogeneity of
variances using “Bayesian smoothing”
Luc Janss
Edinburgh Oct 2008
How many comparisons?
In a 2x2 design with day and treatment
 Day5 vs Day7 (within control)
 Day5 vs Day7 (within Atten)
 Contr vs Atten (within day5)
 Contr vs Atten (within day7)

 ... but 1 is redundant. (e.g. 4th can be known given 3rd
and 2nd).
Luc Janss
Edinburgh Oct 2008
Comparison made in linear
model
 For the 2x2 design the linear model is commonly used
to yield:
 An overall day effect (“main effect”): the differences between
day5 and day7 pooled over treatments
 An overall treatment “main” effect (similarly pooled over days)
 An interaction, to specify ONE deviation from main effects, e.g.
for treatment ATTEN within day7
Luc Janss
Edinburgh Oct 2008
1. Linear models:
estimability, setting up
effects
Luc Janss
Edinburgh Oct 2008
Basics of linear models
 By linear models we describe an observation as a
sum of “constant” effects:
 Mi = mean + “day 3 effect” + “challenge effect”
 By “constant” effect is meant that there is a constant
“challenge effect” computed irrespective of days (unless
specified otherwise)
Luc Janss
Edinburgh Oct 2008
Example linear model
Day
3
Day 5 Day
7
Day
11
Control
x
x
x
x
Challenged
x
x
x
x
Difference between
this data is the overall
effect of challenged
vs. control, assumed
the same at all days
Luc Janss
Edinburgh Oct 2008
Linear model more formal:
Day 3 Day 5
Day 7
Day 11
Control
µ
µ+d5
µ+d7
µ+d11
Challenged
µ+c
µ+d5+c
µ+d7+c
µ+d11+c
Difference
of c
Difference
of d5
Luc Janss
Edinburgh Oct 2008
Linear model
 The effects so described need not match the actual
data: this is the model that describes the data in a
“best” way under the assumptions, there will be some
remaining error
 We do not have a day 3 effect or control effect: the
“day 3 - control”-cell is the reference; the model mean
refers to this reference cell.
 We can take another reference base, and therefore
another model mean.
Luc Janss
Edinburgh Oct 2008
Same (“equivalent”) model but with
reference cell day7-challenged
Day 3
Day 5
Day 7 Day 11
Control
µ+ct+d3
µ+ct+d5
µ+ct
µ+ct+d11
Challenged
µ+d3
µ+d5
µ
µ+d11
What now is estimated as “control effect” is the same as the
­“challenged effect” in the previous model; like this all estimates can be
transformed between the two equivalent models
Luc Janss
Edinburgh Oct 2008
General rules for estimable
effects
 We have to select arbitrarily a reference cell which
gets assigned the model mean
 Most statistical softwares do this automatically, e.g. choosing
the cell with all first level
 For every treatment there will be (# levels – 1)
additional effects to be estimated: 3 effects for for 4
days, 1 effect for control vs. challenged.
Luc Janss
Edinburgh Oct 2008
One extension: interaction.
 Sometimes the “constant” effects model is not
satisfactory, and we can also include an interaction
(this remains a “linear model”)
Day 3
Day 5
Control
µ
µ+d
Challenged
µ+c
µ+c+d+cd
Extra term which says
that something extra is
happening on day 5 in
challenged animals
Luc Janss
Edinburgh Oct 2008
Setting up linear model fit
 Designs may have multiple factors, e.g. 2 lines, 2
treatments, 3 days
 A typical lay-out may then look like:
On Cy3
On Cy5
Slide 1
L1, T1, D1
L1, T1, D3
Slide 2
L1, T2, D1
L1, T2, D3
etc...
...
....
 But this is not the way to put it in a linear model....
Luc Janss
Edinburgh Oct 2008
Linear model set-up
 The “factors” and their levels are for instance:
 Line: 1,2. Treatment: 1,2, Day: 3,5,7
 And a more typical data lay-out for linear models will
look like:
Line
Treatment
Day
Observ 1?
1
1
3
Observ 2?
1
2
5
etc...
2
1
7
Luc Janss
Edinburgh Oct 2008
Linear model set-up
 Complication: the “observations” mostly used are
the “M” values, which do not allow simple linear
model set-up
 In R some special functions are supplied to do so
 Alternative: “Woolfinger” model discussed later.
Luc Janss
Edinburgh Oct 2008
Set-up for reference design
 There is only one “real” experimental measurement
per slide
 Set-up the linear model table with the combination of levels
for that one experimental measurement
 The other measurement is the reference
 Add an extra column which only indicates whether reference
was on cy3 or on cy5
 Special supplied R function makes the final linear
model set-up
 des <- refdesign(~factor(line)+factor(day),ref)
Luc Janss
Edinburgh Oct 2008
A look at final set-up for
reference design
 It can be instructive to have a look at the final
design made by the refdesign() function
 This design contains -1's and +1's depending on the
factors available and whether the slide was
“swapped” or not, inversing the contrasts
 The exact set-up of the contrasts determines for
instance whether a treatment effect is (T1-T2) or
(T2-T1)
Luc Janss
Edinburgh Oct 2008
Set-up for direct comparisons
 Here two “real” experimental measurements are
made, we have two different sets of factors for cy3
and cy5 measurement.
 The following set-up is most easy, filling a line for
each of cy3 and cy5:
Slide
Dye
Line
Treatment
Day
1
3
1
1
3
1
5
2
1
5
2
3
1
2
3
2
5
2
2
5
etc...
Luc Janss
Edinburgh Oct 2008
Alternative model: Woolfinger model
 In this model the data set-up made for direct
comparisons (with a “dye” column) is used directly,
adding “dye” to the model
 The analysed data are not M-values but the log intensity
values
 Can not (simply) be applied to a reference design but is
a convenient model for direct comparison designs

Luc Janss
Edinburgh Oct 2008
“Woolfinger” model
 Advantage:
 The data doubles, there are two intensity values instead of
one difference (M value)
 Can apply linear models directly
 Can analyse “dye effect” to detect bad dye swaps
 Disadvantage:
 Assumes good repeatability of level of expressions across
slides. Analysis of M-values only assumes good repeatability
of ratio of expressions across slides.
Luc Janss
Edinburgh Oct 2008
Detection of “dye effect”:
 Sometimes it happens that red spots are also red in a
dye swap (but they should have turned to green) and
vice versa -> “dye effect”
 You would fit the model (per gene)
logExpr ~ mu + dye + day + treatm
 If everything goes well, dye should not have an effect, effects
need to be taken up by day and treatment
 If dye effects do become significant (too often), something’s
wrong….
Luc Janss
Edinburgh Oct 2008
A final note on linear models
 From linear models we can obtain several contrast,
e.g. differences between 2 days, between 2 lines, and
even an assessment of genes that particularly differ
between days within 1 line only
 Obviously, every such contrast relates to a different
top-list of genes!
 So we obtain many gene-sets of genes doing
something between days, between lines, or between
days only within one line, etc.
Luc Janss
Edinburgh Oct 2008
2. Another look at
variance heterogeneity:
Bayesian variance
smoothing
Luc Janss
Edinburgh Oct 2008
More problems of variance
heterogeneity
 Variance estimates (per gene) are based on small
numbers, therefore can accidentally happen to be (very)
small or happen to be (very) large
 Accidental low variance -> high t-test
 Accidental high variance -> low t-test
 We can apply the philosophy of “random effects” to
variances: we do not believe too much in extreme
values, certainly not when based on small numbers, and
like to pull them to a common mean
Luc Janss
Edinburgh Oct 2008
Variance smoothing
 The function eBayes from the Limma / Bioconductor
package does this variance smoothing
 The effect is especially that accidental small variances
are pulled up
 Extreme t-test are therefore tempered
 The pulling towards the average is stronger when the
variance is computed on a small number
Luc Janss
Edinburgh Oct 2008
Gene variances before and
after eBayes smoothing
Luc Janss
Edinburgh Oct 2008
Download