Gene Expression Microarrays Microarray Normalization Xiaole Shirley Liu STAT115, STAT215 Microarrays • Grow cells at certain condition, collect mRNA population, and label them • Microarray has high density sequence specific probes with known location for each gene/RNA • Sample hybridized to microarray probes by DNA (A-T, G-C) base pairing, wash nonspecific binding • Measure sample mRNA value by checking labeled signals at each probe location 2 Spotted cDNA Arrays • Pat Brown Lab, Stanford University • Robotic spotting of cDNA (mRNA converted back to DNA, no introns) • Several thousand probes / array • One long probe per gene 3 Spotted cDNA Arrays • Competing hybridization – Control – Treatment • Detection – – – – 4 Green: high control Red: high treatment Yellow: equally high Black: equally low Why Competing Hybridization? • DNA concentration in probes not the same, probes not spotted evenly 5 Oligonucleotide Arrays • GeneChip® by Affymetrix • Parallel synthesis of oligonucleotide probes (25mer) on a slide using photolithographic methods • Millions of probes / microarray • Multiple probes per gene • One-color arrays 6 Affymetrix GeneChip Probes 7 Labeled Samples Hybridize to DNA Probes on GeneChip 8 Shining Laser Light Causes Tagged Fragments to Glow 9 Perfect Match (PM) vs MisMatch (MM) (control for cross hybridization) 10 Replicates & Normalization • Replicates: always preferred • Try to preserve biological variation and minimize experimental variation, so different experiments can be compared • Assumption: most genes / probes don’t change between two conditions • Normalization can have larger effect on analysis than downstream steps (e.g. group comparisons) 11 Affymetrix Microarray Imagine Analysis • Gridding: based on spike-in DNA • Affymetrix GeneChip Operating System (GCOS) – cel file X 701 702 Y 523 523 MEAN 311.0 48.0 STDV 76.5 10.5 NPIXELS 16 16 – cdf file • Which probe at (X,Y) corresponds to which probe sequence and targeted transcript • MM probes always (X,Y+1) PM 12 Median Scaling • Linear scaling array1 array1 – Ensure the different arrays have the same median value and same dynamic range – X' = (X – c1) * c2 array2 13 array2 Loess • LOcally WEighted Scatterplot Smoothing • Fit a smooth curve – Use robust local linear fits – Effectively applies different scaling factors at different intensity levels – Y = f(X) – Transform X to X' = f(X) – Y and X' are comparable 14 Quantile Normalization • Bolstad et al Bioinformatics 2003 – Currently considered the best normalization method – Assume most of the probes/genes don’t change between samples • Calculate mean for each quantile and reassign each probe by the quantile mean • No experiment retain value, but all experiments have exact same distribution Experiments Probes 15 Mean Dilution Series • RNA sample in 5 different concentrations • 5 replicates scanned on 5 different scanners • Before and after quantile normalization 16 Normalization Quality Check MvA Plot log2R vs log2G Values should be on diagonal 17 M=log2R- log2G A=(log2R+log2G)/2 Values should scatter around 0 Before Normalization • Pairwise MA plot for 5 arrays, probe (PM) 18 After Normalization • Pairwise MA plot for 5 arrays, probe (PM) 19 When Might qnorm Fail? • Loven et al, Cell 2012 20 Public Microarray Databases • GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly. • Oncomine: Cancer Microarray Database – Published cancer related microarrays – Raw data all processed, nice interface • TCGA: the Cancer Genome Atlas – https://cghub.ucsc.edu/ – http://www.cbioportal.org/public-portal/ 21