:: Microarray analysis :: •Data pre-processing •Normalization •Molecular diagnosis

advertisement
:: Microarray analysis ::
•Data pre-processing
•Normalization
•Molecular diagnosis
•Statistical
classification
Florian Markowetz
florian@genomics.princeton.edu
From experiment to data
Raw data are not mRNA
concentrations
•
•
•
•
tissue contamination
RNA degradation
amplification efficiency
reverse transcription
efficiency
• Hybridization efficiency and
specificity
• clone identification and
mapping
• PCR yield, contamination
• spotting efficiency
• DNA support binding
• other array manufacturing
related issues
• image segmentation
• signal quantification
• “background” correction
Quality control:
Noise and reliable signal
Probe level
Array level
Gene level
Arrays 1 ... n
Probe level: quality of the expression measurement of one spot
on one particular array
Array level: quality of the expression measurement on one
particular glass slide
Gene level: quality of the expression measurement of one probe
across all arrays
Probe-level quality control
• Individual spots printed on the slide
• Sources:
– faulty printing, uneven distribution, contamination with debris,
magnitude of signal relative to noise, poorly measured spots;
• Visual inspection:
– hairs, dust, scratches, air bubbles, dark regions, regions with haze
• Spot quality:
– Brightness: foreground/background ratio
– Uniformity: variation in pixel intensities and ratios of intensities within
a spot
– Morphology: area, perimeter, circularity.
– Spot Size: number of foreground pixels
• Action:
– set measurements to NA (missing values)
– local normalization procedures which account for regional
idiosyncrasies.
– use weights for measurements to indicate reliability in later analysis.
Spot identification
Individual spots are recognized, size and shape might be
adjusted per spot (automatically fine adjustments by
hand).
Additional manual flagging of bad (X) or non-present (NA)
spots
NA
X
poor spot quality
good spot quality
Different Spot identification methods: Fixed circles, circles with
variable size, arbitrary spot shape (morphological opening)
Spot identification
• The signal of the spots is quantified.
Histogram of pixel
intensities of a single spot
„Donuts“
Mean / Median / Mode / 75% quantile
Local background
GenePix
QuantArray
ScanAlyse
Array level quality control
• Problems:
–
–
–
–
–
array fabrication defect
problem with RNA extraction
failed labeling reaction
poor hybridization conditions
faulty scanner
• Quality measures:
–
–
–
–
–
Percentage of spots with no signal (~30% excluded spots)
Range of intensities
(Av. Foreground)/(Av. Background) > 3 in both channels
Distribution of spot signal area
Amount of adjustment needed: signals have to substantially
changed to make slides comparable.
Gene-level quality control
Gene g
• Poor hybridization in the
reference channel may
introduce bias on the foldchange
• Some probes will not hybridize
well to the target RNA
• Printing problems: such that all
spots of a given inventory well
have poor quality.
•A well may be of bad quality – contamination
•Genes with a consistently low signal in the reference channel
are suspicious
Gene expression data
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
gene-expression level or ratio for gene i in mRNA sample j
M=
A=
Log2(red intensity / green intensity)
Function (PM, MM) of MAS, dchip or RMA
average: log2(red intensity), log2(green intensity)
Function (PM, MM) of MAS, dchip or RMA
Scatterplot
Data
Data (log scale)
Message: look at your data on log-scale!
MA Plot
A = 1/2 log2(RG)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to
the same level.
Assumption: the majority of genes are un-changed between
conditions.
Divide all
expression
measurements
of each array
by the Median.
Log Signal, centered at 0
Median is more robust to outliers than the mean.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
after Median-centering
Log Red
M = Log Red - Log Green
M-A Plot of the same data
Log Green
A = (Log Green + Log Red) / 2
M = Log Red - Log Green
Lowess normalization
Local
estimate
A = (Log Green + Log Red) / 2
Use the estimate to bend
the banana straight
Summary I
• Raw data are not mRNA concentrations
• We need to check data quality on different
levels
– Probe level
– Array level (all probes on one array)
– Gene level (one gene on many arrays)
• Always log your data
• Normalize your data to avoid systematic
(non-biological) effects
• Lowess normalization straightens banana
From data to knowledge
Ok, now we made sure that our data is of high quality
and systematic, non-biological effects are removed.
The result is a gene expression matrix
mRNA Samples
sample1 sample2 sample3 sample4 sample5 …
Gene
1
2
3
4
5
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Is that already a result? No! It’s just data, not knowledge.
We need to use this data to answer a scientific question.
Supervised analysis
= learning from examples, classification
– We have already seen groups of healthy and
sick people. Now let’s diagnose the next person
walking into the hospital.
– We know that these genes have function X (and
these others don’t). Let’s find more genes with
function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s
extend the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Un-supervised analysis
= clustering
– Are there groups of genes that behave similarly
in all conditions?
– Disease X is very heterogeneous. Can we
identify more specific sub-classes for more
targeted treatment?
No structure is known. We first need to find
it. Exploratory analysis.
Supervised analysis
Calvin, I still don’t
know the difference
between cats and
dogs …
Oh, now I get it!!
Class 1: cats
Don’t worry!
I’ll show you
once more:
Class 2: dogs
Un-supervised analysis
Calvin, I still don’t
know the difference
between cats and
dogs …
I don’t know it
either.
Let’s try to figure
it out together …
Supervised analysis: setup
• Training set
– Data: microarrays
– Labels: for each one we know if it falls into our
class of interest or not (binary classification)
• New data (test data)
– Data for which we don’t have labels.
– Eg. Genes without known function
• Goal: Generalization ability
– Build a classifier from the training data that is
good at predicting the right class for the new
data.
One microarray, one dot
Expression of gene 2
Think of a space with
#genes dimensions (yes, it’s
hard for more than 3).
Each microarray
corresponds to a point in this
space.
If gene expression is similar
under some conditions, the
points will be close to each
other.
Expression of gene 1
If gene expression overall is
very different, the points will
be far away.
Which line separates best?
A
B
C
D
No sharp knive, but a …
Support Vector Machines
Maximal margin
separating hyperplane
Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well
do we do on the data we
trained the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?
Same classifier (= line)
New data from same classes
The classifier will usually
perform worse than before:
Test error > training error
Cross-validation
Training error
Test error
Train classifier and test it
Train
Test
K-fold Cross-validation
Here for
K=3
Step 1.
Train
Train
Test
Step 2.
Train
Test
Train
Step 3.
Test
Train
Train
Summary II
• Supervised and un-supervised learning
… are needed everywhere in biology and
medicine
• Microarrays = points in high-dimensional spaces
• Classifiers = lines (hyperplanes) in these spaces
• Support Vector Machines use maximal margin
hyperplanes as classifiers
• Classifier performance: Test error > training
error
• Cross-validation is the right way to evaluate
Experimenta
l
Cycle
Biological question
(hypothesis-driven or explorative)
To call in the statistician
after the
Experimental design
experiment is done may be no more than
Failed
Microarray experiment
asking him to perform
a post-mortem
examination:
Quality
Image analysis
Measurement
Pre-processing
He may be able to
say what the
Normalization
Pass
experiment died of.
Analysis
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Ronald
Fisher
Terry Speed,
„Statistical Analysis
of Gene Expression
Microarray Data”.
Chapman &
Hall/CRC
Books
David W. Mount,
„Bioinformatics“, Cold
Spring Harbor
Giovanni
Parmigani et al,
„The Analysis of
Gene Expression
Data“, Springer
Pierre Baldi & G.
Wesley Hatfield,
„DNA Microarrays
and Gene
Expression”,
Cambridge
Gentleman, Carey,
Huber, “Bioinformatics
and Computational
Biology Solutions
Using R and
Bioconductor”,
Springer
And how do I analyze my own
data?
www.r-project.org
www.bioconductor.or
g
•Open source
•Free
•Easy installation
•Helpful community
•High quality standards
•Regularly maintained and updated
•Tons of documentation
•Every package comes with example
vignettes to walk you through standard
Acknowlegdements
• I ‘borrowed’ slides from:
Tim Beissbarth, Achim Tresch, Wolfgang Huber,
Ulrich Mansmann, Terry Speed, Jean Yang,
Benedikt Brors, Anja von Heydebreck, Rainer
König
• More info on microarray analysis, lectures,
tutorials:
http://compdiag.molgen.mpg.de/ngfn/
Download