DNA microarrays: “procedure, fabrication, data processing and analysis” CMMB 461 University of Calgary Gordon Chua 1 Suggested Readings 1. C.A. Harrington et al. (2000) Monitoring gene expression using DNA microarrays, Curr. Opin. Microbiol. 3, 285–291. 2. Hughes, T.R. and Shoemaker, D.D. (2001) DNA microarrays for expression profiling; Curr Opin Chem Biol. 5, 21-5. 3. Quackenbush, J. (2002) Microarray data normalization and transformation. Nat Genet. 32, Suppl:496-501. 4. Leung and Cavalieri (2003) Fundamentals of cDNA microarray data analysis. Trends in Genetics 19, 649-659 5. M.B. Eisen et al. (1998) Cluster analysis and display of genome-wide expression patterns. PNAS 95,14863-8.. 2 1. Background and Introduction to Microarrays: “What are they and what led to its development?” 3 Transcriptome •Transcriptome: defined as a complete set of transcripts encoded in the genome and their relative levels of expression in a particular cell or tissue type under defined conditions. •Characterizing the transcriptome can identify: •Genes exhibiting cell and tissue-specific expression •Genes aberrantly expressed in cell and tissue disease (molecular basis behind the disease) •Genes expressed in response to environmental toxins and pharmaceutical compounds (mode of action and side effects) •Genes expressed in response to pathogens (mode of infection and virulence) 4 • Obtain blood transcriptomes of 104 ASD cases and 82 controls (all males) • Found 55 genes differentially regulated as candidates to diagnose autism • 68% accuracy for ASD identification with these 55 genes for males, poorly for females • Blood test to detect autism may be possible. Kong et al. 2012 PLoS ONE 7: e49475 5 Northern blotting •Conventional method to detect RNA transcripts of a cell and tissue. •To characterize the transcriptome of a human cell or tissue type, you would have to run 25,000 northerns and use 25,000 different probes! Wikipedia 6 DNA microarrays •DNA/gene chip that contains single-stranded probes (25-70 nucleotides) with sequence complementary to a specific gene/mRNA •Each probe is present in many copies in a spot on the microarray •Fluorescent-labelled mRNAs or cDNAs are placed on the microarray to hybridize (complementary base pairing) to the probes •The intensity of the fluorescence is proportional to the abundance of mRNA/cDNA that bind to the probe. •Allows the simultaneous monitoring of the expression (mRNA) level of every gene in an organism in response to genetic and environmental perturbation) •In a single experiment, (two weeks) can determine which genes in the genome are transcriptionally turned on or off 7 Microarray probe design . . . . . . . . . . . . . . . . . . Side view A T C A G C T A G G C A G C C T A A T C A A C A C C T T A G C A A G C T Top view 1. Specificity: unique for each gene, no cross hybridization 2. Homogeneity: bind to complementary DNA at same Tm 3. Sensitivity: not form 2o structures that interfere with hybridizations 8 Microarray procedure Wild type (Control) Mutant/drug (Experimental) X X X X X X X X Z Z X X X X X Isolate total mRNA Y Y Z Z Y Y Y Y Y Y Reverse transcribe and label cDNA with red (Cy5) and green (Cy3) fluorescent dyes X X X Z Z Y Y X XX XX X X Z Z X Y Y Y Y Y YY ZZ Z Y Z Y Y Y Y Y Y Relative levels X Y Z UP DOWN UNCHANGED NOT PRESENT 9 2. Fabrication of Microarrays: “How do they get oligonucleotides probes on a matrix at such high densities?” 10 Ink-jet microarrays (Agilent) Ink-jet print-head uniformly deposits small, accurate volumes (picoliters) of nucleic acids building the 60-mer oligonucleotide probes one base at a time onto a 1’ X 3’ glass slide 4 X 44K Expression microarray •Flexible, customizable •All 60-mer probes are virtually functional •No need for expensive masks: cheaper •Density: >1,000,000 spots/array http://www.agilent.com/about/newsroom/lsca/background/2007/bg_microarrays.pdf 11 Photolithographic microarrays (Affymetrix) •Oligonucleotide probe synthesis on wafer using combination of photolithography and chemistry •Photomask: opaque plate with holes that allow light to shine in specific locations on the silicon wafer •Light removes blocking compound which prevents base addition to wafer •Flood with a chemical base (e.g. adenine) which attaches to unblocked area of wafer http://www.affymetrix.com/estore/about_affymetrix/outreach/educ ator/microarray_img_resources.affx# •Repeat this process blocking compound and photomask. with new 12 3. Labelling, Scanning and Image Processing “Getting colourful microarray images and extracting the data” 13 Common ways to “label” nucleic acids Random priming of doublestranded DNA Reaction contains labelled nucleotides Direct labeling of mRNA with fluorescent molecules: * * Amplification by transcription AAAAAAAA * AAAAAAAA Poly-T primed cDNA synthesis (Reverse transcription) AAAAAAAA Reaction contains labelled nucleotides * AAAAAAAA ** Courtesy of Tim Hughes TTTTTTTTTT TTTTTTTTTT-T7 promoter “second strand” synthesis AAAAAAAA-T7 promoter TTTTTTTTTT-T7 promoter * * * * * T7 reaction contains labelled nucleotides 14 Fluorescence dyes for labelling microarray samples (Cy3 and Cy5) •Fluorescence: emission of light by a molecule that has absorbed light/radiation (excitation) •Water-soluble fluorescent dyes of the cyanine Excitation wavelengths family •Cy5 dye is excited with a 635 nm red laser and detected by a emission filter that passes only 650-690 nm light Emission wavelengths •Cy3 dye is excited with a 532 nm green laser and detected by a emission filter that passes only 557-592 nm light •Fluorescent intensity photomultiplier tube is detected by a http://www.answers.com/topic/cy3-cy5-dyes-gif-1 http://www.jireurope.com/technical/images/GRAPH1.gif 15 How does microarray data initially look like? •For each microarray, acquire two TIFF images (16-bit) scanned with either the Cy5 (red) and Cy3 (green) channel Cy5 channel Cy3 channel Red and Green Laser Scanner (Genepix) https://www.youtube.com/watch?v=VNsThMNjKhM Merged 16 Image Segmentation Spatial segmentation •Partition the image to determine which pixels constitute signal or background •Use an inner circle to calculate signal value and pixels outside the outer circle as local background •Problem: sometimes inner circle is not small enough for tiny spots •Intensity-based segmentation: rank intensity of pixels and take a cut-off equivalent to the approximate area of the spot = signal •Can use a combination of the two types of segmentation •Background correction also can be blank spots or control spots of exogenous DNA 17 Quantification of signal and background •Mean, median, mode and total intensity of segmented signal (microarray spots) and background pixels are determined in a text file (e.g. gpr file) •Signal intensity =total spot intensity-background intensity •Median is usually used because it is more robust to outliers Cy5 Spot location on microarray Cy3 Genes 18 4. Microarray Data Pre-processing and Normalization “Correct the data first before spending all your time analyzing it” 19 Log transformation of expression ratios •When comparing relative abundance of gene expression between two samples, take the ratio of Cy5/Cy3 values (R/G) •Log the expression ratios (log(R/G)): increases symmetrical distribution of data (upregulated and downregulated genes are treated equally) log (R/G) R/G •2-fold change (R/G) = 2 or 0.5 while log2(R/G) = +1 or -1 http://www.bio.davidson.edu/people/macampbell/ACS_MAGIC/transform.html 20 Microarray Data Normalization Required to correct for variations caused by: •Unequal amounts of cDNA •Distinct dye properties (fluorescence/quenching) •Differences in dye incorporation •Differences in scanning 21 Within array/single experiment normalization Cy5-experimental Log intensity-Cy3 Cy3-control Log intensity-Cy5 •Assumption: most genes are not differentially regulated •Graph looks like the vast majority of genes (spots) are upregulated in the experiment •More likely that more labelled-Cy5 cDNA used in the hybridization or non-linear dye properties 22 Global linear normalization •Assume equal quantities of cDNA and total intensity of Cy3 and Cy5 •Normalization constant=Σ(Cy3)/Σ(Cy5) [e.g. 10,000/20,000 = 0.5] •For each gene, multiply the Cy5 intensity by the normalization constant=0.5 (make ratios =1) Log intensity-Cy3 Log intensity-Cy3 •Only works partially because the relationship is not linear. Log intensity-Cy5 Log intensity-Cy5 23 log2(G) M=log2(R/G) Scatterplot versus M/A (R/I) plot log2(R) A=(½ )*log2(R*G) •A= intensity (brightness) of microarray spots •M=log expression ratio •M/A plot allows for detection of intensity-dependent effects on log expression ratios. •Plots above shows that most of the greener spots are low intensity spots http://compbio.pbworks.com/w/page/16252907/Microarray%20Normalization%20and%20Gene%20Expression%20Index 24 Global Lowess (locally weighted linear regression) •Performs a series of local regressions in overlapping windows with a weighted average of neighbouring spots (curve fitting and correction) •Each regression is combined to make the Lowess smooth curve (weighted average values: closer spots have greater weight that far-away spots Window Fitted line is a function of mean intensity 25 Global Lowess (locally weighted linear regression) •Normalized log (R/G)=log(R/G)-Lowess correction •Lowess correction: subtracting the deviation/distance of the Lowess curve from the zero axis from the log ratios of each spot •The output is that log ratios at all intensities have a mean of 0 26 5. Spot and replicate filtering “Improving the quality of data” 27 Filtering out low intensity spots •The normalized log ratios at low intensity spots show greater variation and are less reliable to identify differentially-expressed genes •Use some arbitrary cut-off for low intensity spots Self hybridization experiment 28 Replicate filtering •Plot the normalized log ratios from two replicate experiments •Blue spots are within two standard deviations between both replicates while brown spots > 2 SD are removed Quackenbush (2002) Nature Genetics Suppl. 32 29 Dye-reversal replicates •Uneven incorporation of Cy3 and Cy5 dyes can cause false positives for differentially-expressed genes. •If label a common mRNA sample with Cy3 and Cy5 and hybridize on microarray, then all spots should have a mean of 1 Dye bias (spots are not at 0) •Solution: dye reversal/swap experiment 1. Sample A-Cy3 vs. Sample B-Cy5 2. Sample B’-Cy3 vs. Sample A’-Cy5 •Merged normalized log ratios= [log2(A/B) + log2(B’/A’)]/2 Dabney and Storey (2007) 8:129-138 30 “Now that your data is normalized, what do you do next?” •Identify differentially-expressed genes •Cluster analysis 31 Identifying differentially expressed genes Most straight-forward way is to have a fixed fold change cut-off (usually two fold) +10 2-fold cutoff 0 -10 Log ratio Log ratio +10 0 -10 Problem is that the variability of the log ratio is greater at lower intensities. At lower intensity spots, genes can be misidentified as differentially expressed. At higher intensity spots, differentiallyexpressed genes can be missed. 32 Z-score transformation •Measures the number of standard deviations a particular data point is from the mean/median •Using a sliding window, calculate the local mean and standard deviations within a window surrounding each data point (e.g. 0.25 log units of spot i) •Zi=(log ratio ri-mi)/si, where mi and si are the local mean and standard deviation, respectively 0.5 log units Zi>1.96 spots are differentially regulated at the 95% confidence level Quackenbush (2002) Nature Genetics Suppl. 32 33 Learning objectives: you should be able to… •Explain how transcriptomes microarrays are •Describe how RNA samples microarray experiments used are to study prepared for •Describe the characteristics of microarray probes? •Explain why it is important to normalize microarray data •Describe the two approaches to identify differentiallyexpressed genes 34