Lecture 3 From Images to Data A lot of this is not so relevant NOW, but I think its good basic knowledge. Recap of: How Microarrays are used to measure gene expression • Basic idea: measure the activity level of a gene (its expression level), in a particular cell at a particular time, by measuring the concentration of that gene’s mRNA transcript in the cell’s total RNA. • Immobilize DNA probes (oligos, cDNA) onto glass • Hybridize labelled target mRNA (in reality cDNA equivalent) with probes on glass, • Measure how much binds to each probe (i.e. forms ds DNA). (In two-channel arrays, equal amounts of two differently labelled target cDNAs are hybridized to the probes.) • Recall:Dyes are chosen to have different peak emission wavelengths, Cy3’s is 532 nm and Cy5’s is 635 nm. What is measured in Microarrays? • We measure the amount of labeled target cDNA which is bound to the immobilized probe by exciting the labeling molecules (the dye) with a laser, and collecting and counting the photons emitted. • In practice the chip is scanned (excitation emission counting of emitted photons), and the result is a digital image. This then needs to be processed to locate the probes in the image and assign intensity measurements to each of them. There can be from hundreds to millions of different probes on each slide/chip. From Slide to Image • Uses confocal microscope (scanner) or charged couple device (CCD) • Scanned region is divided into equally sized pixels • Laser generates excitation light which is focused on a small portion of the microscopic slide • Fluorescence molecules in the area absorb the excitation photons and emit fluorescence photons • These fluorescence photons are WHAT we want to measure • These are gathered on a lens • We don’t want the excitation photons (they distort results) • Use a dichroic beam splitter and band pass the fluorescence photons • Detector converts the emission photons into electric current (photomultiplier tube, PMT) • Analog to digital converter (A/D) is used to convert electrons into digital images. Scanner Process Laser Dye PMT A/D Convertor Electrons Signal Photons excitation amplification Filtering Time-space averaging Possible Noise • Source Noise and Detector Noise • Source Noise: – Excitation photons, dust on slides, treatment of the glass slide • Detector Noise: – Amplification, digitization The perfect image should only reflect measures related to the particular dye, but in practice it is common to have images that are combined signals of photon noise, electronic noise, laser light reflection and background fluorescence. Adjustments to minimize noise • One can adjust (depending on the scanner) – Scan rate, laser power, PMT voltage In general higher laser power More signal, but more noise, In general higher PMT More signal per photon, may saturate pixels Example of the TIFF image for a malaria data: we will look at the numbers later From Image to numbers • TIFF images are processed by the image analysis program to give intensity values for each probe. • Pixel intensities are typically from 0 to 216 -1 (a 16 bit scale). IMAGE ACQUISITION – Addressing/ SPOT Recognition/ Gridding: Location of each spot- computational – Segmentation: differentiate pixel (spot) from foreground/background – Data Extraction: calculates Intensity, background and ratio for each spot. Images : examples Pseudo-color overlay cy3 cy5 Spot color Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed Example: Addressing This is the process of assigning coordinates to each of the spots. Basic structure is KNOWN, i.e. it is known before hand how many grids and how many rows and columns in a grid 4 by 4 grids 19 by 21 spots per grid Gridding/Addressing • Determination of the center of each spot • Hard, since the spots are SO small, however the fact that the grid is known helps a lot • Generally, put a fixed grid is placed over array and semimanual adjustments made (mentioned before) • Results in partitioning the array into areas, each containing a spot and a background. Gridding/Addressing • Basic structure is KNOWN, i.e. it is known before hand how many grids and how many rows and columns in a grid • Addressing is matching the idealized model to the scanned image • Parameters include: – – – – – Separation between rows and columns of a grid Individual translations Separation between rows and columns within each grid Small individual translation of spots Overall position of the array image Addressing Within the same batch of print runs. Estimate the translation of grids 4 by 4 grids Other problems: -- Mis-registration -- Rotation -- Skew in the array Addressing Registration Registration Segmentation • Segmentation methods : – – – – Fixed circle segmentation Adaptive circle segmentation Adaptive shape segmentation Histogram segmentation Fixed circle ScanAlyze, GenePix, QuantArray Adaptive circle GenePix, Dapple Adaptive shape Spot, region growing and watershed Histogram method ImaGene, QuantArraym DeArray and adaptive thresholding Fixed circle segmentation • Fits a circle with a constant diameter to all spots in the image • Easy to implement • The spots need to be of the same shape and size Bad example ! Adaptive circle segmentation • The circle diameter is estimated separately for each spot • Dapple finds spots by detecting edges of spots • Problematic if spot exhibits oval shapes, which is often the case Limitation of circular segmentation —Small spot —Not circular Results from SRG Adaptive shape segmentation • Specification of starting points or seeds • Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region. Adaptive Shape segmentation • Uses WATERSHED (Beuhar and Meyer, 1993: Mathematical morphology in image processing, Chapter 12) and • Seeded Growing Algorithm, SRG (Adams and Bischof, 1994, Seeded Growing Region, IEEE Transactions on Pattern Analysis and Machine Intelligence) – Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region. • Start with a seed and grow the shape. • Perfect for microarrays since the array is known and hence the location of the seed is not difficult (SPOT). Seeds Histogram Segmentation • Older and Simpler • Plots the histogram of all the pixels in the area containing the Spot and its background • Ideally should have a bimodal shape with the higher mode the pixel value and the lower mode the background • Does not make use of the spatial nature of the data. Quantification • Having decided which pixels belong to each spot and the background intensity, suitable statistics are calculated for each spot. • The “spot intensity” is generally the mean or median (GenePix, ScanAlyze, QuantArray, Spot) of the pixels intensity for that spot. • Theory states that the level of fluorescence is directly proportional to the RNA • Local background is calculated using median pixel intensity since the mean can be distorted by outliers. Local background • Focusing on small regions surrounding the spot mask. • Median of pixel values in this region • Most software package implement such an approach ScanAlyze ImaGene Spot, GenePix • By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure Local Backgrounds Information Extraction — Spot Intensities —mean (pixel intensities). —median (pixel intensities). — Background values —Local —Morphological opening —Constant (global) —None Take the average Quality Measurements • Array – Correlation between spot intensities. – Percentage of spots with no signals. – Distribution of spot signal area. • Spot – Signal / Noise ratio. – Variation in pixel intensities. – Identification of “bad spot” (spots with no signal). • Ratio (2 spots combined) – Circularity Example of what the data would look like 2 channel cDNA: Num Array Row Array Col Row Col Name X Locn Y Locn ch1 Intensity ch1 Backgrd ch1 Intensity Std Dev ch1 Backgrd Std Dev ch2 Intensity ch2 Backgrd ch2 Intensity Std Dev ch2 Backgrd Std Dev 1 1 1 1 1 2660 7060 2259.5 140.2 1309.4 100.1 6782.3 220.0 3804.6 5.7 2 1 1 1 2 2910 7070 555.6 123.4 464.0 16.8 2067.3 400.0 1439.8 293.2 3 1 1 1 3 3180 7060 1488.2 167.6 981.4 567.6 3845.7 345.0 2150.7 745.6 4 1 1 1 4 3450 7060 1140.1 921.3 752.1 34.2 2837.2 553.4 1627.6 158.9 5 1 1 1 5 3680 7050 2106.0 555.2 1369.9 19.9 5990.6 518.0 3721.4 653.1 Filter 1 1 1 1 1 The AFFY Chip Affymetrix GeneChips® Probes = 25 bp sequences Probe Sets = set of probes corresponding to a particular gene or EST. In the past there has been 20 probes/probe set on human chips, 16 on mouse, while there are 11 on Human GeneChips® HG-U133A. Most genes or ESTs contain one probe set, but quite a few have > 1. DATA AND NOTATION • PMijg, MMijg: Intensity for perfect match and mismatch probe in cell j for gene g in chip I – i=1…n: From one to hundreds of chips – J=1…J: from 16-20 probe pairs – g=1…G: from 8000-35000 probe sets • Compute SIGNAL/ Expression measure DATA AND NOTATION PROBE SET 1 PROBE SET 2 Probe Cell Probe Pair Expression Value Calculation (Signal) • The signal represents the amount of transcript in solution • Signal is calculated as follows (in brief): - Cell intensities are preprocessed for global background - An ideal mismatch value is calculated and subtracted to adjust PM intensity - The adjusted PM intensities are log transformed to stabilize the variance - The Tukey’s biweight estimator is used to provide a robust mean of the signal - Signal is output as the antilog of the mean signal value - Finally the signal is scaled to generated a normalized data DETECTION ALGORITHM • BACKGROUND: Average of the lowest 2% of the intensities subtracted. Algorithm • First calculate R: ability to detect intended target for each probe R = (PM-MM)/(PM+MM) – R near 1 means PM>>MM – R near or below 0 means PM <= MM • Define: t (default 0.015) as cutoff for R to be “present” for each probe pair Algorithm contd… • Calculate (R-t) for each Probe. • Rank Probes according to their (R-t) Values • Apply Wilcoxin’s Sign test (nonparametric) to generate the detection pvalue. Wilcoxon Signed Rank Test • • • • To test if the median of a distribution q >,<, ≠, q0. Non-parametric equivalent of one sample mean problem. Model: yi = q + ei Procedure for greater than 0 alternative. – – – – – – Subtract q0 from the yi as zi=yi-q0 Calculate absolute values |zi| and define yi = 1 if zi is positive and 0 otherwise Rank the absolute values, Ri Test Statistic, sum of positive ranks, S= Riy i Find the corresponding p-value P(S > s) and reject if p-value is small. Calculating p-values • Logic: Lets find the distribution of the Test Statistic. – If there are n observations the total number of possible configurations for the ranks is 2n – For n=8, there are 256 possible outcomes • • • • All positive, S=1+2+…+8 =36 One negative: ( 8 options) with S=35,…,28 Two negative: (28 options) S= 33,…,1 And so on •All we need are the extreme outcomes and see how extreme our test statistic is. Discrimination Score [R] 80 PM MM 10 100 1 Increasing Tau: reduces false positives but also reduces the number of present calls R t 0 -0.2 MM Intensity/probe pair Detection Call • Detection Call is based on p-value cut offs: Alpha1 and Alpha2 provide boundaries for P,M,A calls • Default: a1=0.04, a2=0.06 • p<a1: P, p>a2: A, intermediate: M a1 P 0.00 a2 M 0.04 A 0.06 1.00 Example: PM MM R 61215.0 283.3 .992 39000.8 40252.0 -.02 61246.0 239.0 .992 60345.0 286.0 .991 59293.0 190.8 .994 54310.5 6314.0 .792 50324.8 265.0 .990 62199.3 218.0 .993 Zi= R-.015 0.977 -0.035 0.977 0.976 0.979 0.777 0.975 0.978 |Zi| Rank Pos or not 0.842 5.5 1 .170 1 0 .842 5.5 1 .841 4 1 .844 8 1 .642 2 1 .840 3 1 .843 7 1 s=35 P(S>35)= 1/256=.003 PRESENT SIGNAL • Calculated using One –step Tukey’s Biweight estimate: robust weighted mean, insensitive to outliers • One STEP Tukey’s Biweight Algorithm – Let data be xi, let m represent the median of the data. – Calculate: Median Absolute Deviation – (MAD)= Med |x-m| – Ui = (xi –m)/(cMAD+e) – Here c is the tuning constant (set at 5), e=.0001 (so that we don’t have division by zero) Tukey Bi-weight • Weights are calculated as: (1 ui2 ) 2 when | ui | 1 w(ui ) 0 otherwise • Tukey-Biweight is: n wi xi Tbi i 1 n wi i 1 Comments on Biweight • Generally used for multiple interations, but here we are using just one iteration of it. • Supposedly very robust as an estimator. Signal Calculation • • • • • • Signal= Tukey Biweight{log(PMj-IM)j} IM= Idealized mismatch which is never greater than PM If MM<PM them IM=MM If MM>PM, then use IM Calculate SB (Specific Background) SB = [ Tbi( log2 (PM) – log2 (MM)) ] Signal Calculation • What is the Idealized Mismatch (IM)? • According to Affymetrix • the reason for including the MM probe is to provide a value that comprises of most of the cross-hybridizations and stray signals affecting the PM probe. • It does contain a portion of the true signal. If MM is less than PM then it can be directly used. • If not we calculate the IM. To do so, first calculate the Specific Background (SB) for each probe pair in a probe set: Specific Background Calculations • • • • • • Calculate y=log2 (PM/MM) Find Median(y) = m Calculate MAD=Median|y-m| Define u= (y-m)/(c*MAD+e) with c=5 and e=.0001 Define w= (1-u2)2 if |u| ≤1, 0 otherwise Tb(y) = Syiwi/ Swi Example: Specific Background calculations PM 61215 60345 59293 54311 50325 62199 39000 61246 MM log2(pm/mm) diff med 238.3 8.0049624 0.1437 286 7.7210753 -0.1402 190.8 8.2796568 0.4184 6314 3.104605 -4.7567 265 7.5691334 -0.2921 218 8.1564264 0.2952 40252 -0.0455863 -7.9069 239 8.0014612 0.1402 7.8612682 Abs dev med 0.1436942 0.140193 0.4183886 4.7566633 0.2921349 0.2951582 7.9068546 0.140193 0.2936465 u 0.098 -0.1 0.285 -3.24 -0.2 0.201 -5.38 0.095 abs(u) 0.098 0.095 0.285 3.239 0.199 0.201 5.385 0.095 w 0.981 0.982 0.844 0 0.922 0.921 0 0.982 1 1 1 0 1 1 0 1 sum 7.852 7.581 6.99 0 6.982 7.511 0 7.856 7.949 Signal Calculation • First we need to define Idealized Mismatch MM , MM PM PM IM , MM PM , SB t SB 2 PM , MM PM , SB t t t SB ) (1+ 2 By default t=.03 =10 Signal Calculation Contd • Vij=Max(PMij-IMij, d) • The d is a small positive constant. • Signal=Tbi(log2(Vij)) • Keep in mind here we are SUBTRACTING IM from PM and not taking a ratio as we did for SB. Example for Signal Calculation PM 61215 60345 59293 54310.5 50324.8 62199.3 39000 61246 MM IM x=log2(pm-mm) x-med abs(x-med)u w 238.3 238.3 15.89597 0.033461931 0.033462 0.146673574 0.957437 15.21938 286 286 15.87409 0.01158431 0.011584 0.050777468 0.99485 15.79234 190.8 190.8 15.85092 -0.01158431 0.011584 -0.050777468 0.99485 15.76929 6314 6314 15.55064 -0.311866938 0.311867 -1.367005326 0 0 265 265 15.61136 -0.251143617 0.251144 -1.10083699 0 0 218 218 15.91955 0.057036871 0.057037 0.250009528 0.878897 13.99165 40252 158.27 15.24532 -0.617188684 0.617189 -2.705321133 0 0 239 239 15.89669 0.034178645 0.034179 0.14981514 0.955615 15.19111 15.86251 0.045608 15.88652 60578.7 Signal=58611.65, P Program for Signal Calculation for PM and MM data. Have the data in a csv file called signal1.csv setwd("/myRfolder") hwdata<-read.table(“signal1.csv",header=TRUE,sep=",",na.strings=" ") #SB calculation p1<-hwdata$PM m1<-hwdata$MM y<-log2(p1/m1) md<-median(y) z<-abs(y-md) mz<-median(z) u<-(y-md)/(5*mz+.0001) w<-ifelse(abs(u)<=1,(1-u^2)^2,0) sb=sum(y*w)/sum(w) sb #signal calculation begins IM<-ifelse(m1>p1,p1/(2^sb),m1) i1<-IM ys<-log2(p1-i1) ms<-median(ys) zs<-abs(ys-ms) ss<-median(zs) us<-(ys-ms)/(5*ss+.0001) ws<-ifelse(abs(us)<=1,(1-us^2)^2,0) tbs<-sum(ws*ys)/sum(ws) sgnl<-2^tbs Idealized Mismatch AFFY: SINGLE CHIP EXPRESSION AFFY: TREATMENT COMPARISON: TWO CHIPS