Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University MIAME (Minimum Information About a Microarray Experiment) MIAME describes the Minimum Information About a Microarray Experiment that is needed to enable the interpretation of the results of the experiment unambiguously and potentially to reproduce the experiment. [Brazma et al, Nature Genetics] MIAME raw data (CEL or GPR files) final processed (normalized) data essential sample annotation including experimental factors and their values experimental design including sample data relationships sufficient annotation of the array essential laboratory and data processing protocols Databases using MIAME ArrayExpress at EBI GEO at NCBI CIBEX at DDBJ ArrayExpress http://www.ebi.ac.uk/microarray-as/aer/ Stores transcriptomics and related data Data warehouse stores gene indexed expression profiles In accordance with MGED recommendations: MIAME ArrayExpress statistics Experiment repository: 2,914 experiments (each with at least 6 microarrays) and growing Expression profiles: including 267 experiments, 121,891 genes Data warehouse updated everyday Searching ArrayExpress Keywords: breast cancer, cell cycle, … etc. Accession numbers: E-XXXX-d, e.g. E-AFFY-1281, E-TIGR-372, … etc. Secondary accession numbers: GEO accession, e.g. GSE5389. Species names mainly in Latin names (e.g. Homo sapiens), common names may be used as well (e.g. human). ArrayExpress interface ArrayExpress Search/Browse Result Keyword: lung cancer ArrayExpress Search/Browse Result Detailed view Expression Profile results Thumbnail view BigPlot view Gene ranking (most differentially expressed experiments are top ranked) Similarity search: search genes with similar expression levels Take a break… Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ Gene expression/molecular abundance repository MIAME compliant Supports browsing, query and retrieval GEO record types Platform Sample Series DataSet Profile GEO Platform Platform record defines the list of elements that may be detected and quantified in that experiment (e.g., cDNAs, oligonucleotide probesets) Each Platform record is assigned a unique and stable GEO accession number (GPLxxx) A Platform may reference many Samples that have been submitted by multiple submitters GEO Sample Sample record describes the conditions under which an individual Sample was handled, the manipulations it underwent, and the abundance measurement of each element derived from it Each Sample record is assigned a unique and stable GEO accession number (GSMxxx) A Sample entity must reference only one Platform and may be included in multiple Series GEO Series A Series record links together a group of related Samples and provides a focal point and description of the whole study Series records may also contain tables describing extracted data, summary conclusions, or analyses Each Series record is assigned a unique and stable GEO accession number (GSExxx) GEO DataSet Assembled in NCBI Samples are all equivalently measured and normalized Can be viewed and analyzed with NCBI’s advanced data display and analysis tool GEO Profile Profile consists of the expression measurements for an individual gene across all Samples in a DataSet Profiles can be searched using Entrez GEO Profiles Similar to Expression Profile in ArrayExpress SOFT (Simple Omnibus Format in Text) Text based Line based Easily parsed with text processing languages, including Perl, Python, Ruby, PHP, … etc. Take a break… Network Biology Visualization and Analysis Cytoscape Open source network visualization and analysis software ‘Core’ features include network layout and query, also integrate visualizations with state data Can be extended by plugins Cytoscape developers University of California at San Diego (Trey Ideker) Institute for Systems Biology (Leroy Hood) Memorial Sloan-Kettering Cancer Center (Chris Sander) Institut Pasteur (Benno Schwikowski) Agilent Technologies (Annette Adler) University of California at San Francisco (Bruce Conklin) Cytoscape A java application Require Java 5 or 6 (JDK5/6 or JRE5/6) Simple Interaction Format (SIF) Each line denotes one interaction InteractorA xx Interactor B ‘xx’ are interaction types: pp: protein-protein interaction pd: protein-DNA interaction (transcription factor/regulation) pr (protein-reaction), rc (reactioncompound), cr (compound-reaction), gl (genetic-lethal), pm (protein-metabolite), mp (metabolite-protein) Other interaction formats supported GML XGMML SBML BioPAX PSI-MI Tab-delimited text table and excel Cytoscape Demonstration Applications of Gene Expression Gene selection (differentially expressed genes) State annotation in networks (expression level) Gene regulatory network identification