Basic Principles in Bioinformatics: Understanding Microarrays Pierre Farmer/Pascale Anderle Swiss Institute for Bioinformatics/ISREC Aim of This Course Rapid overview of microarray technologies Introduction to different bioinformatic solutions related to microarrays Overview of the Course Part I Introduction into the microarray technology Illustration of typical biological questions related to microarray studies Short presentation of methods being used for the analysis of microarray data Part II (TP) Discussion of biological questions and how they could be answered applying microarray data mining Part III Functional classification Biological Problem What is the difference between a tumor and healthy tissue? Are the different types of tumors? Biological Fundamentals Biological Fundamentals Transcriptome: Genes Proteome: Proteins Microarrays Genomics Fundamentals Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Exon 1 Intron 1 Exon 2 Intron 2 Exon 3 Intron 3 Genomic DNA: ATGC Transcription Exon 1 Exon 2 Exon 3 Messenger RNA: AUGC Reverse Transcription: RT-PCR Exon 1 Exon 2 Exon 3 cDNA: ATGC Exon 1 Exon 2 Exon 3 PCR Exon 2 Exon 3 Exon 2 Exon 3 cDNA/PCR product: ATGC Introduction into Microarray Technology : Sample Tumor Tissue Normal Tissue Protein ……CCAGGCAAUAAAAAA ……CCAGGCAAUAAAAAA mRNA ……A U G AGUAAUAAAAAA ……A U G AGUAAUAAAAAA ……CCAGGCAATAAAAAA ……CCAGGCAATAAAAAA ……A T G AGTAATAAAAAA Signal N < Signal T Signal N << Signal T ……A T G AGTAATAAAAAA Introduction into Microarray Technology Normal Gene A Tumor Gene B Gene C Introduction into Microarray Technology Spotting: Probes Photolithography Printing Oligomers Physical support: Glass slide, nylon membrane PCR products Sample preparation and hybridization: cRNA or cDNA Single-labeling or dual-labeling Affymetrix: Short oligo chip Single labeling Fluorescence or radioactivity or cDNA chip: Oligos or PCR products Dual-labeling Microarray: Definition •Microarray analysis is a technology that allows to simultaneously detect the expression of thousands of genes in a small sample. •Microarrays are simply ordered sets of DNA molecules of known sequence fixed on a physical support. Different Microarray Platforms Definition of biological questions Experimental design Custom array PCR products Oligomers Commercial array Short oligos: Affymetrix Long oligos: Agilent Chip preparation Probe design Probe preparation Printing Sample preparation cRNA/cDNA Labeling Hybridization Scanning Data Acquisition and Data Analysis Making the Chip: Probe Design Choosing genes of interest for the experiment Probe selection strategy should ensure: everything! Or…. • Biologically meaningful results (The truth...) • Coverage, sensitivity (... The whole truth...) • Specificity (... And nothing but the truth) • Annotation Making the Chip: Probe Design • Coding region (ORF) • Annotation relatively safe • No problems with alternative polyA sites • No repetitive elements or other funny sequences • Danger of close isoforms • Danger of alternative splicing 3’ Untranslated region • Annotation less safe • Danger of alternative polyA sites • Danger of repetitive elements • Less likely to cross-hybridize with isoforms • Little danger of alternative splicing • 5’ Untranslated region • Close linkage to promoter • Frequently not available 5’utr EXON A EXON B INTRON 3’utr Probe Design for Custom Array Keywords, seed sequences Search Pfam HMM db HMM Models Run hmmsearch against GenPept db Putative new genes Filter genes (human only, set cut off, eliminate red. genes) Transporters: 670 Channels: 263 Transporters: 316 Channels: 151 Contigs: 156 Positive Controls: 9 Negative Controls: 3 Controls (diff. Oligos): 9 RGS: 75 FGF/RGF-like: 7 ADAM family: 18 Run Pick70 Multiple alignment and selection of repr. genes Run Pick70 Tm = 70, Palindrome Uniqueness = 15 bp 236 Contigs and singlets Assemble contigs Remove vector and characterized ESTs Protein seed sequence Converged PSI-Blast Brown et al. AAPS PharmSci. 2003 Core Protein Family Blast human EST db EST nucleotide sequence Anderle et al. Pharm Res. 2003 THE EXPERIMENT : Printing I The microspotting is done by a robot called “arrayer” THE EXPERIMENT : Printing II Microspotting THE EXPERIMENT : Printing III Oligo-spotting (Photolithography) Summary ? Microarray Analysis: Data Analysis Definition of biological questions Experimental design Scanning and Processing images } Calculation of expression values per probe set Normalization across chips Statistical analysis of expression values Clustering of expression values Annotation of probe sets Functional classification Biological interpretation of data } Low level analysis High level analysis Data Analysis: Processing of Images II Addressing or gridding Assigning coordinates to each of the spots Segmentation Classification of pixels either as foreground or as background Intensity extraction (for each spot) Foreground fluorescence intensity pairs (R, G) Background intensities FG FG Quality measures M Fluorescence Signal to Expression Level GTTAAGCGTTCCGATGCTACTTACC PM GTTAAGCGTTCCCATGCTACTTACC MM Probes Probes mRNA reference sequence = representative sequence Consensus sequence Fluorescence Signal to Expression Level I Example: Affymetrix • • • • ~ 30 % MM signal > PM signal Probes of given set mapped to different UniGene clusters Same probe mapped to different UniGene clusters Ca. 340 MM mapped to UniGene clusters Computing Expression Values Microarray Analysis Suite (MAS 5.0): signal = TukeyBiweight{log( PM j − MM *j )} with MM*, a version of MM that is never bigger than PM, Tukey biweight is a type of robust estimator... Li and Wong model: PMij − MM ij = θ iφ j + ε ij , εij ∝ N(0,σ 2 ) θi is gene expression in chip i, φj is rate of increase of PM response over MM (probespecific effect) Robust multi-chip analysis (RMA) log(PM ij − BG) = ai + b j + ε ij Use only PM, ignore MM, assumes additive model (on log scale), estimates chip effects ai and probe effects bj using a robust method (median polish) MAS 5 vs. RMA: A Values MAS 5 vs. RMA: M vs. A Plot RMA MAS 5 Data Analysis: Transformation (Coding) Log2 transformation No transformation Effect of different scheme of data transformation on the total distribution of expression values. Data: Alon et al. PNAS 1999 Ratios: un-transformed 2 distance Log2 transformed: 1 distance 2 0 1 0 0.5 0.5 1 -1 2X = y; log2(y) = x 22 = 4; log2(4) = 2 Data Analysis: Normalization I Tentative separation of systematic sources of variation ("artefacts") that bias the results from random sources of variation ("noise") that hide the truth. Typical Statistical Approach: Measured value = real value + systematic errors Corrected value = real value Analysis of corrected value => (unbiased) CONCLUSIONS Examples of systematic errors: Fluorochrome incorporation Spatial bias + noise + noise Data Analysis: Normalization II Self-self hybridization: Non-normalized data No Self-self hybridization: Non-normalized data Scatter (MVA-)plots Normalization: global Normalization based on a global adjustment log2 R/G → log2 R/G - c = log2 R/(kG) Common choices for k or c = log2k are c = median or mean of log ratios for a particular gene set (e.g. all genes, or control or housekeeping genes) Another possibility is total intensity normalization, where k = ∑Ri/ ∑Gi Median centering Normalization Ratio 0 2 Log2 Ratio Data Analysis: Normalization III Methods: Median center: MEDIAN log2( CY3/CY5) = 0 CY5 CY5 Linear Transformation CY3 CY5 CY3 CY3 Why is not satisfactory? More noise with low–expressed genes Data Analysis: Use of M vs A Plot 1. 2. 3. 4. Logs stretch out region we are most interested in. Can more clearly see features of the data such as intensity dependent variation, and dye-bias. Differentially expressed genes more easily identified. Intuitive interpretation M = log R/G = logR - logG A = ( logR + logG ) /2 Data Analysis: Normalization IV M M 0 0 A A Magnification M M 0 M 0 A 0 A Loess correction A Data Analysis: Normalization IV M 0 Sub-array A Array M Regional Variation 0 Spatial Bias A Data Analysis: Normalization V Use of spikes Before normalization After normalization Data Analysis: Low Level Analysis Summary: Chip has been built! Signals have been measured! Systematic errors have been removed! Data Analysis: Limitations Problems in data analysis Limitations of traditional biological interpretations: Complexity (10 000 genes) How to distinguish a true positive result from a false positive? Methods: 1. Supervised learning: k-Nearest neighbor, LDA 2. Non-supervised learning: Clustering Data Analysis: Clustering Objectives Gene discovery/Class identification Sample/Gene classification Looking for characterization of the components of the data set, without a priori input on cases or genes Finding genes, combinations of genes or samples that match a particular a priori pattern Labels are not used Labels are used Unsupervised learning Supervised learning Hierarchical clustering/Dendrograms K-means clustering Self organizing maps (SOM) LDA k-NN classification Supervised vs. Unsupervised Learning: Examples 1. Example Identification of genes that are responsible for the fact that some patients respond differently to a certain type of chemotherapy 2. Example Identification of genes or group of genes that explain the difference between tumor tissue and non-tumor tissue based on the expression profile of ~100 samples (60 tumor tissue/ 40 healthy tissue) 3. Example Identify a group of genes that are co-regulated upon a given treatment Unsupervised Learning Problems Circularity of spots Unsupervised Methods This is clustering! Length of neck Similar objects are grouped together How do we measure similarity Agglomerative Hierarchical Clustering I Before doing such clustering, one has to define two things: 1- The similarity measure between two genes (or experiments) Correlation: Distance = 1 - R Euclidean: Distance = sqrt((x1-x2)2+ (y1+y2)2) Sample 2 Sample 3 Sample 1 Sample 1 2- The distance measure between the new cluster and the others Single Linkage: Distance between closest pair. Complete Linkage: Distance between farthest pair. Average Linkage: Distance between cluster centers Agglomerative Hierarchical Clustering II Distance between joined clusters Gene 1 4 2 5 3 1 1 3 2 4 Dendrogram 5 Gene 2 The Thedendrogram dendrograminduces inducesaalinear linearordering orderingof of the data points the data points Clustering: Defining Clusters Unsupervised Clustering: Example Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(19):10869-74 Supervised Methods: Learning Problems Which criteria should we use? Supervised Methods: Examples k-Nearest Neighbor (knn) Sample Data Matrix Gene 2 Gene ? Gene 1 PCA Gene 2 LDA Gene 2 Gene 1 Gene 1 Supervised Methods: Learning Problems Which criteria should we use? Supervised Learning: Problems Supervised Methods: Cross-Validation Training Set Training Set Labels Data Matrix Labels Microarray Data Tissues Genes Test Set Labels Test Set Evaluation Subset Training Subset Test Predicted Labels LDA Predictor Supervised Methods: Experimental Design Subset 1 Subset 2 Subset 3 Subset 4 Characteristics: Test set: 15 Tissues Training set: 45 Tissues Trained Model Trained Model Trained Model Trained Model Cross Validation Cross Validation Cross Validation Cross Validation Learning set Test set The 4 subsets are used for cross-validation (Data set from Alon et al. 1999). Always same proportions of Normal / Cancer Tissues All data once (and only once) in test set Supervised Methods: Student’s Test →LDA I Group A Group B t - Statistics For all Genes-> Compute the t statistics LDA done with the most “differently expressed”, then most and the second most……etc (Cumulative) Supervised Methods: Student’s Test →LDA II Percent of correct predictions Effect of the Number of Genes Selected with a Student's t-Test on the LDA Performance. 120 100 80 60 Test Set Training Set (12,89) 40 20 0 0 10 20 30 Number of genes (cummulative) 40 Summary: Part I Microarray analysis allows simultaneously detection of the expression of thousands of genes in a small sample. Microarray experiments includes: - Experimental design - Making of the chip - Preparation of samples, hybridization, detection of fluorescence signals - Low level analysis: - Transformation of fluorescence signal measurement into an expression level values - Normalization - High level analysis - Clustering, statistical analysis, functional classification Part II: Practical Course 1. Exercise In which steps of a typical microarray experiment may optimized computational methods contribute to an improvement ? 2. Exercise What features would you include in a probe design program? 3. Exercise Which methods do you think the authors applied to answer their questions described in the abstracts? 4. Exercise What are the principal objectives of a supervised or unsupervised learning method, respectively? 5. Exercise What do you think are the major limitations of microarrays? 6. Exercise When would you rather use RMA or MAS5, respectively? 7. Exercise Why is normalization crucial for the analysis of microarray data? 8. Exercise How can you relate microarray data and phenotypes? Part II: Abstract A Novel genes and functional relationships in the adult mouse gastrointestinal tract identified by microarray analysis. Bates MD, Erwin CR, Sanford LP, Wiginton D, Bezerra JA, Schatzman LC, Jegga AG, Ley-Ebert C, Williams SS, Steinbrecher KA, Warner BW, Cohen MB, Aronow BJ. Division of Gastroenterology, Children's Hospital Medical Center, University of Cincinnati College of Medicine, Cincinnati, Ohio 45229, USA. michael.bates@chmcc.org BACKGROUND & AIMS: A genome-level understanding of the molecular basis of segmental gene expression along the anterior-posterior (A-P) axis of the mammalian gastrointestinal (GI) tract is lacking. We hypothesized that functional patterning along the A-P axis of the GI tract could be defined at the molecular level by analyzing expression profiles of large numbers of genes. METHODS: Incyte GEM1 microarrays containing 8638 complementary DNAs (cDNAs) were used to define expression profiles in adult mouse stomach, duodenum, jejunum, ileum, cecum, proximal colon, and distal colon. Highly expressed cDNAs were classified based on segmental expression patterns and protein function. RESULTS: 571 cDNAs were expressed 2-fold higher than reference in at least 1 GI tissue. Most of these genes displayed sharp segmental expression boundaries, the majority of which were at anatomically defined locations. Boundaries were particularly striking for genes encoding proteins that function in intermediary metabolism, transport, and cell-cell communication. Genes with distinctive expression profiles were compared with mouse and human genomic sequence for promoter analysis and gene discovery. CONCLUSIONS: The anatomically defined organs of the GI tract (stomach, small intestine, colon) can be distinguished based on a genome-level analysis of gene expression profiles. However, distinctions between various regions of the small intestine and colon are much less striking. We have identified novel genes not previously known to be expressed in the adult GI tract. Identification of genes coordinately regulated along the A-P axis provides a basis for new insights and gene discovery relevant to GI development, differentiation, function, and disease. Gastroenterology 2002 May;122(5):1467-82 Part II: Abstract B Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA. golub@genome.wi.mit.edu Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge. Science 1999 Oct 15;286(5439):531-7 Functional Classification 1 5 2 4 3 4 4 4 Microarray Data Analysis Workflow. Existing data (repository) (1)-> generate data (2) -> collect & manage data (3) (Microarray data management systems) -> analyze interesting sequences (4) -> depositing into repositories (5) Functional Classification Typical questions to be answered with functional classification: •Whether a gene has a known function, and if so, in what class? •Whether genes found to cluster together have been described as being functionally similar or related (promoter motifs, transcription factors) •Whether homologs or orthologs have been found to be functionally related in any known physiological or pathological state •Whether the resultant genes are known to be associated with the experimental conditions tested. Functional Classification Grouping and Clustering Transcriptional ‘Signatures’ Identification of common promoter elements and regulatory networks. GO: Gene Ontology Gene product description Biological process Cellular component Molecular function Chromosomal Location Name Gene ID Cytochrome p450 subfamily 4 HMG CoA synthase Apolipoprotein CIII Stearoyl-coenzyme desaturase Carnitine palmitoyl transferase-1 Fatty acid binding protein Phosphoenoyl carboxykinase Cluster determinant 36 Cyp 4A HMG-CoA Syn Apo CIII SCOD-1 CPT-1 FABP PEPCK CD36 T1 T2 x x x x Metabolic Pathway Assignment x x x x x Chromosomal Location: Annotation Affymetrix Representative Sequence Representative sequence Consensus sequence BLAT against assembly sequence from UCSC Comparison with UG DB NetAffx Unigene Ensembl DB Probes Tagger Exact mapping to UG and RefSeq DB Exact mapping to temp cDNA DB SIB annotation 4 quality levels EnsMart DB Representative Sequence: Chosen during chip design as a sequence which is best associated with the transcribed region being interrogated BLAT threshold: Only records whose match / Qsize >= 75% and; only records whose score >= 0.70, where score = (match - mismatch - gap# x 5 - gap_size x 2) / Qsize; If record has several mapping locations with score > 0.70, choose the highest one; if a record has several mapping locations with the same highest score, all mapping locations kept. EnsMart Approach: cDNA sequence plus an additional length of downstream sequence immediately following the most 3' exon. The individual probe sequences are mapped, by exact matching. If more than 50 % of probes mapped, then listed as hits. Comparison of Various Annotations NetAffx A: 21545 B: 22014 EnsMart A: 3209 A: 2686 A: 796 B: 904 B: 8473 B: 499 A: 15421 B: 5507 A: 11269 B: 4027 A: 4381 A: 147 B: 8610 B: 77 Mouse MOE A and B A: 5085 B: 2533 NetAffx Tagger A: 20882 A: 22446 B: 22112 A: 2384 A: 418 B: 169 B: 7300 B: 15247 Human U133 A and B EnsMart A: 1193 B: 355 A: 12460 A: 6409 B: 1853 A: 149 B: 12790 B: 85 A: 2657 B: 1728 Tagger A: 21675 B: 16456 A: 14220 B: 2462 Quality of Probe Sets Chip HG-133A HG-133B Mu74v2A Mu74v2B Mu74v2C MOE-A MOE-B High 13792 3795 5340 2587 756 12683 2453 Medium 1663 790 1283 969 302 2395 620 Low 1103 519 1697 1190 982 1194 592 Undefined 5657 17473 4102 7665 9828 6354 18846 Chip HG-133A HG-133B Mu74v2A Mu74v2B Mu74v2C MOE-A MOE-B High 15703 10096 8015 7010 2600 18070 11602 Medium 1196 2026 615 1421 780 1222 2376 Low 3983 3125 2127 2306 2555 2383 2478 Undefined 1333 7330 1665 1674 5933 951 6055 Mapped on: RefSeqs Mapped on: RefSeqs mRNAs ESTs HTCs Distribution: UGs per Probe Set 100000 Number of Probe Sets 10000 EnsMart A EnsMart B 1000 Tagger A Tagger B 100 NetAffx A NetAffx B 10 1 1 10 Number of UniGenes 100 Functional Classification Grouping and Clustering Transcriptional ‘Signatures’ Identification of common promoter elements and regulatory networks. GO: Gene Ontology Gene product description Biological process Cellular component Molecular function Chromosomal Location Name Gene ID Cytochrome p450 subfamily 4 HMG CoA synthase Apolipoprotein CIII Stearoyl-coenzyme desaturase Carnitine palmitoyl transferase-1 Fatty acid binding protein Phosphoenoyl carboxykinase Cluster determinant 36 Cyp 4A HMG-CoA Syn Apo CIII SCOD-1 CPT-1 FABP PEPCK CD36 T1 T2 x x x x Metabolic Pathway Assignment x x x x x Gene Ontology Project GO Output Cellular Component L3 L3 L4 GO:X Molecular Function L2 L3 GO:Y Biological processes L3 GO:Z L3 L4 GO:Y ABCB1 Two pragmatic purposes of ontology: 1. Facilitate communication between people and organizations 2. Improve interoperability between systems Ontologies are structured vocabularies in the form of directed acyclic graphs (DAGs) that represent a network in which each term may be a “child” of one or more than one ”parent”. Distribution: Probe Sets per UG 100000 U133A 10000 U133B U133AB Number of UniGenes U74Av2 U74Bv2 U74Cv2 1000 U74ABCv2 U74ABCv3_NA MOE430A MOE430B MOE430AB 100 10 1 1 10 Number of Probe Sets 100 Functional Classification II Grouping and Clustering Transcriptional ‘Signatures’ Identification of common promoter elements and regulatory networks. GO: Gene Ontology Gene product description Biological process Cellular component Molecular function Chromosomal Location Name Gene ID Cytochrome p450 subfamily 4 HMG CoA synthase Apolipoprotein CIII Stearoyl-coenzyme desaturase Carnitine palmitoyl transferase-1 Fatty acid binding protein Phosphoenoyl carboxykinase Cluster determinant 36 Cyp 4A HMG-CoA Syn Apo CIII SCOD-1 CPT-1 FABP PEPCK CD36 T1 T2 x x x x Metabolic Pathway Assignment x x x x x MAPPFinder – GenMAPP Doniger et al. Genome Biology 2003 http://www.genmapp.org/ GenMAPP: Gene Microarray Pathway Profiler KEGG: Kyoto Encyclopedia of Genes and Genomes The 3 main goals of the KEGG project: 1. 2. 3. Computerizing the current knowledge of genetics, biochemistry, and molecular and cellular biology in terms of the pathway of interacting molecules or genes Collection of genes catalogs for all organisms with completely sequenced genomes and selected organisms with partial genomics (consistent annotation) Catalog of chemical elements, compounds and other substances in living cells Summary of KEGG release 8.0 Kanehisa et al. 2002, Nucleic Acids Research, Ogata et al. 1999, Nucleic Acids Research; http://www.genome.ad.jp/kegg/ Functional Classification II Grouping and Clustering Transcriptional ‘Signatures’ Identification of common promoter elements and regulatory networks. GO: Gene Ontology Gene product description Biological process Cellular component Molecular function Chromosomal Location Name Gene ID Cytochrome p450 subfamily 4 HMG CoA synthase Apolipoprotein CIII Stearoyl-coenzyme desaturase Carnitine palmitoyl transferase-1 Fatty acid binding protein Phosphoenoyl carboxykinase Cluster determinant 36 Cyp 4A HMG-CoA Syn Apo CIII SCOD-1 CPT-1 FABP PEPCK CD36 T1 T2 x x x x Metabolic Pathway Assignment x x x x x Signaling Pathways Similar to other nuclear hormone receptors, PPAR acts as a ligand activated transcription factor. Upon binding fatty acids or hypolipidemic drugs, PPARa interacts with RXR and regulates the expression of target genes. These genes are involved in the catabolism of fatty acids. Conversely, PPARg is activated by prostaglandins, leukotrienes and anti-diabetic thiazolidinediones and affects the expression of genes involved in the storage of the fatty acids. PPARb is only weakly activated by fatty acids, prostaglandins and leukotrienes and has no known physiologically relevant ligand. However, data from PPARb null mice suggest PPARb does serve a role in fatty acid metabolism and perhaps in skin proliferation and cancer. Genetic Network Models: Goals Must incorporate rule-based dependencies between genes Rule-based dependencies may constitute important biological information. Must allow to systematically study global network dynamics In particular, individual gene effects on long-run network behavior. Must be able to cope with uncertainty Small sample size, noisy measurements, robustness Must permit quantification of the relative influence and sensitivity of genes in their interactions with other genes This allows us to focus on individual (groups of) genes. Microarray and Data Repositories Name Archival Treatment Visualization Acuity dual-color cDNA/oligo dual-color cDNA/oligo ArrayDB dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo. Dendrograms, 2-D interactive plots, animated interactive 3-D plots, line graphs, scatter plots. dual-color cDNA/oligo ArrayInformatics dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix, Scatter, line and series plots and a cluster image map,. is not supporting XML as of yet. BASE dual-color cDNA/oligo, Affymetrix, SAGE Affymetrix dual-color cDNA/oligo, Affymetrix, SAGE Expressionist dual-color cDNA/oligo, Affymetrix, SAGE Affymetrix Affymetrix, dual-color cDNA/oligo Normalization to LOWESS, total intensity, median ratio or to a user generated gene list, graphing data trends after normalization enabling examination of data variability. global mean or median ratio based normalization, Lowess, MDS module standard data processing and clustering GeneDirector dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix ImaGene and GeneSight packagse GeNet dual-color cDNA/oligo, Affymetrix filters, dual-color cDNA/oligo, Affymetrix, dual-color cDNA/oligo, Affymetrix filters, dual-color cDNA/oligo, Affymetrix, dual-color cDNA/oligo, Affymetrix GeneSpring package filters, dual-color cDNA/oligo, Affymetrix, GeneX dual-color cDNA/oligo, Affymetrix dual-color cDNA/oligo dual-color cDNA/oligo, Affymetrix Global normalization, z-score, Lowess normalization, full and sub-grid, for Affymetrix, alternative probe based protocol R routines are available to manipulate the data (normalization, clustering, etc.) maxdSQL dual-color cDNA/oligo, Affymetrix dual-color cDNA/oligo, Affymetrix Filtering based on numerical values. 2-D correlation plot with overlay of cluster data, multidimensional plots. NOMAD dual-color cDNA/oligo, Axon scanner outcome dual-color cDNA/oligo, Axon scanner outcome dual-color cDNA/oligo, Affymetrix, maxdView, expression data class which represents results from one or more hybridizations and any associated clusters of genes. Profiles viewers. dual-color cDNA/oligo, Axon scanner outcome PartisanarrayLIMS filters, dual-color cDNA/oligo, Affymetrix, Affymetrix, Nylon filters, dual-color cDNA/oligo filters, dual-color cDNA/oligo, Affymetrix, Affymetrix, Nylon filters, dual-color cDNA/oligo filters, dual-color cDNA/oligo, Affymetrix, global mean or median ratio based normalization Affymetrix, Nylon filters. Table Viewer: K-means, Kmedians clustering, and SOM algorithms. dual-color cDNA/oligo dual-color cDNA/oligo dual-color cDNA/oligo Error models with any experimental replicates performed, P-values computed and error bars for every gene expression measurement, ANOVA. ScanAlyse package: global normalization GeneTraffic(Multi) Resolver SMD Data normalization protocols and data analyses modules global normalization, normalization on control spots, spike controls, or subset of spots. Hierarchical clustering, K-means, PCA, SOM. global mean or median ratio based normalization ScanAlyse package: global normalization Microarray and Data Repositories Name GEO RAD ExpressDB CleanEx Gene Expression Database SMD Data Type Microarray/ SAGE Microarray/ SAGE Tissue Type Normal and tumor Normal and tumor Microarray/ SAGE Microarray/ EST libraries Microarray Yeast Description Gene expression and hybridization array data repository The ultimate goal is to allow comparative analysis of experiments performed by different laboratories using different platforms and investigating different biological systems. Collection of yeast RNA expression datasets Normal and tumor Gene expression and hybirdization array data repository. SAGE will be added. Tumor Data from 60 cancer cell lines based on Affymetrix and cDNA technology http://discover.nci.nih.gov/arraytools Microarray Normal and tumor Normal and tumor Extensive collection of cDNA microarray data http://genomewww.stanford.edu/microarray http://www.ncbi.nlm.nih.gov/SAGE/ SAGEmap SAGE SAGE SAGE UniGene EST libraries EST libraries EST libraries EST libraries CGAP/Tissue BodyMap TissueInfo Normal and tumor Normal and tumor Normal and tumor Normal and tumor Normal Data from one hundred SAGE (Serial Analysis of Gene Expression) CGAP (Cancer Genome Anatomy Project) libraries SAGE data from over 600,000 transcripts including SAGE data from human, mouse and yeast transcripts. Collection of EST libraries from different species Web address http://www.ncbi.nlm.nih.gov/geo/ http://www.cbil.upenn.edu/RAD2/ http://arep.med.harvard.edu/cgibin/ExpressDByeast/EXDStart http://www.epd.isb-sib.ch/cleanex/ http://www.sagenet.org/SAGEData/ sagedata.htm http://www.ncbi.nlm.nih.gov/UniGene/ Information on CGAP and other cDNA libraries. http://cgap.nci.nih.gov/Tissues/xProfiler Database of expression information of human and mouse genes in various tissues and cell types. Information on tissue expression profile of a sequence by comparing the given sequence against the EST database. Each EST comes from a library derived from a specific tissue type http://bodymap.ims.u-tokyo.ac.jp http://icb.mssm.edu/services/tissueinfo/qu ery Web Resources : General Information Leung’s Links page & software info Davison’s DNA Microarray Methodology - Flash Animation gene-chips Overview of the technique, papers… Chips & microassays General information SMD guide Stanford's links page, very complete Introduction Online introduction to microarrays (EBI) Brown Lab Guide Microarrays protocols and arrayer construction. Web Resources : Data Analysis Tools Expression Profiler Online clustering and analysis tools (EBI) GenEx Database, repository and analysis tools (NCGR) MAExplorer MicroArray Explorer for data mining Gene Expression, free download ArrayDB Downloadable tools, short online demo MAXD Downloadable data warehouse and visualisation for expression data Jexpress Java tools for gene expression data analysis, free download Eisen Lab Michael Eisen's suite for image quantitation and data analysis (Scanalyze, Cluster, TreeView). Downloadable. Web Resources : Public Databases I SMD The Stanford Microarray Database Chip DB Searchable database on gene expression (MIT) ExpressDB Public queries of E. coli and yeast data GEO Gene expression data repository and online resource (NCBI) RAD RNA Abundance Database Expression Connection Saccharomyces Genome Database expression data retrieval EpoDB Expression information retrieval for one gene at a time yMGV Public queries of yeast data Web Resources : Public Databases II AMAD Downloadable web driven database system ArrayExpress Public data deposition and public queries (EBI) maxdSQL Downloadable data warehouse and visualization environment GXD Mouse expression data storage and integration GeNet Distribution and visualization of gene expression data from any organism Web Resources : Public Databases III Drosophila microarray project Drosophila Metamorphosis Time Course Database Samson Lab Yeast Transcriptional Profiling Experiments SageMap NCBI SAGE data and analysis tools NCI60 cancer project Supplement to Ross et al. (Nat Genet., 2000). Serum-response Supplement to Lyer et al.(1999) Science 283:83-87 Breast cancer Supplement to Perou et al. Nature 406:747-752(2000) Cancer Molecular Pharmacology Integration of large databases on gene expression and molecular pharmacology. References Interesting Books Kohane et al., Microarrays for an integrative genomics, 2003 MIT Baldi and Hatfield, DNA Microarrays and gene expression, 2002 Cambridge University Press Jagota, Microarray data analysis and visualization, 2001 Bay Press