L3_IMAG-edited.ppt

advertisement
Lecture 3
From Images to Data
A lot of this is not so relevant NOW,
but I think its good basic knowledge.
Recap of: How Microarrays are used to
measure gene expression
• Basic idea: measure the activity level of a gene (its expression level),
in a particular cell at a particular time, by measuring the concentration
of that gene’s mRNA transcript in the cell’s total RNA.
• Immobilize DNA probes (oligos, cDNA) onto glass
• Hybridize labelled target mRNA (in reality cDNA equivalent) with
probes on glass,
• Measure how much binds to each probe (i.e. forms ds DNA). (In
two-channel arrays, equal amounts of two differently labelled
target cDNAs are hybridized to the probes.)
• Recall:Dyes are chosen to have different peak emission wavelengths,
Cy3’s is 532 nm and Cy5’s is 635 nm.
What is measured in Microarrays?
• We measure the amount of labeled target cDNA which is bound to the
immobilized probe by exciting the labeling molecules (the dye) with a
laser, and collecting and counting the photons emitted.
• In practice the chip is scanned (excitation emission counting of emitted
photons), and the result is a digital image. This then needs to be
processed to locate the probes in the image and assign intensity
measurements to each of them. There can be from hundreds to millions
of different probes on each slide/chip.
From Slide to Image
• Uses confocal microscope (scanner) or charged couple device (CCD)
• Scanned region is divided into equally sized pixels
• Laser generates excitation light which is focused on a small portion of
the microscopic slide
• Fluorescence molecules in the area absorb the excitation photons and
emit fluorescence photons
• These fluorescence photons are WHAT we want to measure
• These are gathered on a lens
• We don’t want the excitation photons (they distort results)
• Use a dichroic beam splitter and band pass the fluorescence photons
• Detector converts the emission photons into electric current
(photomultiplier tube, PMT)
• Analog to digital converter (A/D) is used to convert electrons into
digital images.
Scanner Process
Laser
Dye
PMT
A/D
Convertor
Electrons
Signal
Photons
excitation
amplification
Filtering
Time-space
averaging
Possible Noise
• Source Noise and Detector Noise
• Source Noise:
– Excitation photons, dust on slides, treatment of the glass slide
• Detector Noise:
– Amplification, digitization
The perfect image should only reflect measures related to the
particular dye, but in practice it is common to have images that are
combined signals of photon noise, electronic noise, laser light
reflection and background fluorescence.
Adjustments to minimize noise
• One can adjust (depending on the scanner)
– Scan rate, laser power, PMT voltage
In general higher laser power
More signal, but more noise,
In general higher PMT
More signal per photon, may saturate pixels
Example of the TIFF image for a malaria data:
we will look at the numbers later
From Image to numbers
• TIFF images are processed by the image analysis
program to give intensity values for each probe.
• Pixel intensities are typically from 0 to 216 -1 (a 16
bit scale).
IMAGE ACQUISITION
– Addressing/ SPOT Recognition/ Gridding:
Location of each spot- computational
– Segmentation: differentiate pixel (spot) from
foreground/background
– Data Extraction: calculates Intensity,
background and ratio for each spot.
Images : examples
Pseudo-color overlay
cy3
cy5
Spot color
Signal strength
Gene expression
yellow
Control = perturbed
unchanged
red
Control < perturbed
induced
green
Control > perturbed
repressed
Example: Addressing
This is the process of assigning
coordinates to each of the
spots.
Basic structure is KNOWN, i.e.
it is known before hand how
many grids and how many
rows and columns in a grid
4 by 4 grids
19 by 21 spots per grid
Gridding/Addressing
• Determination of the center of each spot
• Hard, since the spots are SO small, however the fact that
the grid is known helps a lot
• Generally, put a fixed grid is placed over array and semimanual adjustments made (mentioned before)
• Results in partitioning the array into areas, each containing
a spot and a background.
Gridding/Addressing
• Basic structure is KNOWN, i.e. it is known before hand
how many grids and how many rows and columns in a grid
• Addressing is matching the idealized model to the scanned
image
• Parameters include:
–
–
–
–
–
Separation between rows and columns of a grid
Individual translations
Separation between rows and columns within each grid
Small individual translation of spots
Overall position of the array image
Addressing
Within the same batch of
print runs. Estimate the
translation of grids
4 by 4 grids
Other problems:
-- Mis-registration
-- Rotation
-- Skew in the array
Addressing
Registration
Registration
Segmentation
• Segmentation methods :
–
–
–
–
Fixed circle segmentation
Adaptive circle segmentation
Adaptive shape segmentation
Histogram segmentation
Fixed circle
ScanAlyze, GenePix, QuantArray
Adaptive circle
GenePix, Dapple
Adaptive shape
Spot, region growing and watershed
Histogram method
ImaGene, QuantArraym DeArray and adaptive thresholding
Fixed circle segmentation
• Fits a circle with a constant diameter to all
spots in the image
• Easy to implement
• The spots need to be of the same shape and
size
Bad example !
Adaptive circle segmentation
• The circle diameter is estimated
separately for each spot
• Dapple finds spots by
detecting edges of spots
• Problematic if spot exhibits oval
shapes, which is often the case
Limitation of circular segmentation
—Small spot
—Not circular
Results from SRG
Adaptive shape segmentation
• Specification of starting points or seeds
• Regions grow outwards from the seed points
preferentially according to the difference between a
pixel’s value and the running mean of values in an
adjoining region.
Adaptive Shape segmentation
• Uses WATERSHED (Beuhar and Meyer, 1993: Mathematical
morphology in image processing, Chapter 12) and
• Seeded Growing Algorithm, SRG (Adams and Bischof, 1994, Seeded
Growing Region, IEEE Transactions on Pattern Analysis and Machine
Intelligence)
– Regions grow outwards from the seed points preferentially according to
the difference between a pixel’s value and the running mean of values in
an adjoining region.
• Start with a seed and grow the shape.
• Perfect for microarrays since the array is known and hence the location
of the seed is not difficult (SPOT).
Seeds
Histogram Segmentation
• Older and Simpler
• Plots the histogram of all the pixels in the area containing
the Spot and its background
• Ideally should have a bimodal shape with the higher mode
the pixel value and the lower mode the background
• Does not make use of the spatial nature of the data.
Quantification
• Having decided which pixels belong to each spot and the
background intensity, suitable statistics are calculated for
each spot.
• The “spot intensity” is generally the mean or median
(GenePix, ScanAlyze, QuantArray, Spot) of the pixels
intensity for that spot.
• Theory states that the level of fluorescence is directly
proportional to the RNA
• Local background is calculated using median pixel
intensity since the mean can be distorted by outliers.
Local background
• Focusing on small regions surrounding the spot mask.
• Median of pixel values in this region
• Most software package implement such an approach
ScanAlyze
ImaGene
Spot, GenePix
• By not considering the pixels immediately surrounding the
spots, the background estimate is less sensitive to the
performance of the segmentation procedure
Local Backgrounds
Information Extraction
— Spot Intensities
—mean (pixel intensities).
—median (pixel intensities).
— Background values
—Local
—Morphological opening
—Constant (global)
—None
Take the average
Quality Measurements
• Array
– Correlation between spot intensities.
– Percentage of spots with no signals.
– Distribution of spot signal area.
• Spot
– Signal / Noise ratio.
– Variation in pixel intensities.
– Identification of “bad spot” (spots with no signal).
• Ratio (2 spots combined)
– Circularity
Example of what the data would look like
2 channel cDNA:
Num Array Row Array Col Row Col Name X Locn Y Locn ch1 Intensity ch1 Backgrd ch1 Intensity Std Dev ch1 Backgrd Std Dev ch2 Intensity ch2 Backgrd ch2 Intensity Std Dev ch2 Backgrd Std Dev
1 1
1 1 1
2660 7060 2259.5
140.2
1309.4
100.1
6782.3
220.0
3804.6
5.7
2 1
1 1 2
2910 7070 555.6
123.4
464.0
16.8
2067.3
400.0
1439.8
293.2
3 1
1 1 3
3180 7060 1488.2
167.6
981.4
567.6
3845.7
345.0
2150.7
745.6
4 1
1 1 4
3450 7060 1140.1
921.3
752.1
34.2
2837.2
553.4
1627.6
158.9
5 1
1 1 5
3680 7050 2106.0
555.2
1369.9
19.9
5990.6
518.0
3721.4
653.1
Filter
1
1
1
1
1
The AFFY Chip
Affymetrix GeneChips®
Probes = 25 bp sequences
Probe Sets = set of probes corresponding to a particular
gene or EST.
In the past there has been 20 probes/probe set on
human chips, 16 on mouse, while there are 11 on
Human GeneChips® HG-U133A.
Most genes or ESTs contain one probe set, but quite a
few have > 1.
DATA AND NOTATION
• PMijg, MMijg: Intensity for perfect match
and mismatch probe in cell j for gene g in
chip I
– i=1…n: From one to hundreds of chips
– J=1…J: from 16-20 probe pairs
– g=1…G: from 8000-35000 probe sets
• Compute SIGNAL/ Expression measure
DATA AND NOTATION
PROBE SET 1
PROBE SET 2
Probe Cell
Probe Pair
Expression Value Calculation (Signal)
• The signal represents the amount of transcript in solution
• Signal is calculated as follows (in brief):
- Cell intensities are preprocessed for global background
- An ideal mismatch value is calculated and subtracted to adjust PM intensity
- The adjusted PM intensities are log transformed to stabilize the variance
- The Tukey’s biweight estimator is used to provide a robust mean of the signal
- Signal is output as the antilog of the mean signal value
- Finally the signal is scaled to generated a normalized data
DETECTION ALGORITHM
• BACKGROUND: Average of the lowest 2% of the
intensities subtracted.
Algorithm
• First calculate R: ability to detect intended target
for each probe
R = (PM-MM)/(PM+MM)
– R near 1 means PM>>MM
– R near or below 0 means PM <= MM
• Define: t (default 0.015) as cutoff for R to be
“present” for each probe pair
Algorithm contd…
• Calculate (R-t) for each Probe.
• Rank Probes according to their (R-t) Values
• Apply Wilcoxin’s Sign test (nonparametric) to generate the detection pvalue.
Wilcoxon Signed Rank Test
•
•
•
•
To test if the median of a distribution q >,<, ≠, q0.
Non-parametric equivalent of one sample mean problem.
Model: yi = q + ei
Procedure for greater than 0 alternative.
–
–
–
–
–
–
Subtract q0 from the yi as zi=yi-q0
Calculate absolute values |zi| and
define yi = 1 if zi is positive and 0 otherwise
Rank the absolute values, Ri
Test Statistic, sum of positive ranks, S=  Riy i
Find the corresponding p-value P(S > s) and reject if p-value is
small.
Calculating p-values
• Logic: Lets find the distribution of the Test Statistic.
– If there are n observations the total number of possible
configurations for the ranks is 2n
– For n=8, there are 256 possible outcomes
•
•
•
•
All positive, S=1+2+…+8 =36
One negative: ( 8 options) with S=35,…,28
Two negative: (28 options) S= 33,…,1
And so on
•All we need are the extreme outcomes and see how extreme
our test statistic is.
Discrimination Score [R]
80
PM
MM
10
100
1
Increasing Tau: reduces false
positives but also reduces the
number of present calls
R
t
0
-0.2
MM Intensity/probe pair
Detection Call
• Detection Call is based on p-value cut offs:
Alpha1 and Alpha2 provide boundaries for P,M,A
calls
• Default: a1=0.04, a2=0.06
• p<a1: P, p>a2: A, intermediate: M
a1
P
0.00
a2
M
0.04
A
0.06
1.00
Example:
PM
MM
R
61215.0
283.3
.992
39000.8
40252.0
-.02
61246.0
239.0
.992
60345.0
286.0
.991
59293.0
190.8
.994
54310.5
6314.0
.792
50324.8
265.0
.990
62199.3
218.0
.993
Zi=
R-.015
0.977
-0.035
0.977
0.976
0.979
0.777
0.975
0.978
|Zi|
Rank
Pos or
not
0.842
5.5
1
.170
1
0
.842
5.5
1
.841
4
1
.844
8
1
.642
2
1
.840
3
1
.843
7
1
s=35
P(S>35)=
1/256=.003
PRESENT
SIGNAL
• Calculated using One –step Tukey’s Biweight estimate:
robust weighted mean, insensitive to outliers
• One STEP Tukey’s Biweight Algorithm
– Let data be xi, let m represent the median of the data.
– Calculate: Median Absolute Deviation
– (MAD)= Med |x-m|
– Ui = (xi –m)/(cMAD+e)
– Here c is the tuning constant (set at 5), e=.0001 (so that
we don’t have division by zero)
Tukey Bi-weight
• Weights are calculated as:

(1  ui2 ) 2 when | ui |  1
w(ui )  

0 otherwise
• Tukey-Biweight is:
n
 wi xi
Tbi  i 1
n
 wi
i 1
Comments on Biweight
• Generally used for multiple interations, but here we are
using just one iteration of it.
• Supposedly very robust as an estimator.
Signal Calculation
•
•
•
•
•
•
Signal= Tukey Biweight{log(PMj-IM)j}
IM= Idealized mismatch which is never greater than PM
If MM<PM them IM=MM
If MM>PM, then use IM
Calculate SB (Specific Background)
SB = [ Tbi( log2 (PM) – log2 (MM)) ]
Signal Calculation
• What is the Idealized Mismatch (IM)?
• According to Affymetrix
• the reason for including the MM probe is to provide a
value that comprises of most of the cross-hybridizations
and stray signals affecting the PM probe.
• It does contain a portion of the true signal. If MM is less
than PM then it can be directly used.
• If not we calculate the IM. To do so, first calculate the
Specific Background (SB) for each probe pair in a probe
set:
Specific Background Calculations
•
•
•
•
•
•
Calculate y=log2 (PM/MM)
Find Median(y) = m
Calculate MAD=Median|y-m|
Define u= (y-m)/(c*MAD+e) with c=5 and e=.0001
Define w= (1-u2)2 if |u| ≤1, 0 otherwise
Tb(y) = Syiwi/ Swi
Example: Specific Background calculations
PM
61215
60345
59293
54311
50325
62199
39000
61246
MM log2(pm/mm) diff med
238.3 8.0049624 0.1437
286 7.7210753 -0.1402
190.8 8.2796568 0.4184
6314
3.104605 -4.7567
265 7.5691334 -0.2921
218 8.1564264 0.2952
40252 -0.0455863 -7.9069
239 8.0014612 0.1402
7.8612682
Abs dev med
0.1436942
0.140193
0.4183886
4.7566633
0.2921349
0.2951582
7.9068546
0.140193
0.2936465
u
0.098
-0.1
0.285
-3.24
-0.2
0.201
-5.38
0.095
abs(u)
0.098
0.095
0.285
3.239
0.199
0.201
5.385
0.095
w
0.981
0.982
0.844
0
0.922
0.921
0
0.982
1
1
1
0
1
1
0
1
sum
7.852
7.581
6.99
0
6.982
7.511
0
7.856
7.949
Signal Calculation
• First we need to define Idealized Mismatch



MM , MM  PM

 PM
IM  
, MM  PM , SB  t
SB
2

PM
, MM  PM , SB  t

t

t  SB
)
 (1+

2
By default
t=.03
=10
Signal Calculation Contd
• Vij=Max(PMij-IMij, d)
• The d is a small positive constant.
• Signal=Tbi(log2(Vij))
• Keep in mind here we are SUBTRACTING IM from PM
and not taking a ratio as we did for SB.
Example for Signal Calculation
PM
61215
60345
59293
54310.5
50324.8
62199.3
39000
61246
MM IM
x=log2(pm-mm)
x-med
abs(x-med)u
w
238.3
238.3 15.89597 0.033461931 0.033462 0.146673574 0.957437 15.21938
286
286
15.87409
0.01158431 0.011584 0.050777468 0.99485 15.79234
190.8
190.8 15.85092 -0.01158431 0.011584 -0.050777468 0.99485 15.76929
6314
6314 15.55064 -0.311866938 0.311867 -1.367005326
0
0
265
265
15.61136 -0.251143617 0.251144 -1.10083699
0
0
218
218
15.91955 0.057036871 0.057037 0.250009528 0.878897 13.99165
40252 158.27 15.24532 -0.617188684 0.617189 -2.705321133
0
0
239
239
15.89669 0.034178645 0.034179 0.14981514 0.955615 15.19111
15.86251
0.045608
15.88652 60578.7
Signal=58611.65, P
Program for Signal Calculation for PM and MM data.
Have the data in a csv file called signal1.csv
setwd("/myRfolder")
hwdata<-read.table(“signal1.csv",header=TRUE,sep=",",na.strings=" ")
#SB calculation
p1<-hwdata$PM
m1<-hwdata$MM
y<-log2(p1/m1)
md<-median(y)
z<-abs(y-md)
mz<-median(z)
u<-(y-md)/(5*mz+.0001)
w<-ifelse(abs(u)<=1,(1-u^2)^2,0)
sb=sum(y*w)/sum(w)
sb
#signal calculation begins
IM<-ifelse(m1>p1,p1/(2^sb),m1)
i1<-IM
ys<-log2(p1-i1)
ms<-median(ys)
zs<-abs(ys-ms)
ss<-median(zs)
us<-(ys-ms)/(5*ss+.0001)
ws<-ifelse(abs(us)<=1,(1-us^2)^2,0)
tbs<-sum(ws*ys)/sum(ws)
sgnl<-2^tbs
Idealized Mismatch
AFFY: SINGLE CHIP
EXPRESSION
AFFY: TREATMENT
COMPARISON: TWO CHIPS
Download