Curation practices at the EMAGE gene expression database Jeff Christiansen PhD EMAGE Senior Curator MRC Human Genetics Unit Edinburgh Overview What is EMAGE? Data types Data sources Annotation methods and searching methods Curation Aspects Genes, sequences, text descriptions and links General biocuration efforts Access to images (data access) Towards standardised experimental reporting Maintaining and developing the framework housing EMAGE data Data Preservation Gene Expression Every cell in the body contains copies of all genes (~25,000) in the nucleus. However, different cell types ‘express’ different sets of these genes. Gene expression = DNA ... mRNA … protein Detection of a specific mRNA or protein is performed to profile the expression of a gene. Gene Expression profiling Gene expression profiling can be done: en masse (in a dissociated tissue) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. individually (in an intact tissue (in situ)) EMAGE: Data Types EMAGE holds in situ expression data (mRNA and protein) in mouse embryos EMAGE: Data Types EMAGE holds in situ expression data (mRNA and protein) in mouse embryos QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Whole embryo photos Photos of sections 3D images EMAGE: Data Sources EMAGE: Data Sources Literature EMAGE: Data Sources Literature 155 main journals contain data concerning mouse embryos 36 Mouse Genome Informatics (USA) curators manually read these journals and create a basic index of papers containing relevant data (i.e. since 1993, 14,375 papers, contain 59,906 images showing in situ expression results for 10,052 genes) 4 full-time MGI/GXD staff fully curate a proportion of these papers (~25%) EMAGE: Data Sources Large scale screening projects EMAGE: Data Sources Large scale screening projects GUDMAP - mouse urogenital system: ~34,000 images - curated data (5 year project) GenSat - mouse embryo E12.5: ~5,000 images - non-curated FaceBase - mouse craniofacial: ~ 4000 2D + ~ 4000 3D images - curated data (2 year pilot project) EURExpressII - mouse embryo E14.5: ~500,000 images - curated data (4 year project) EMAGE: Data Sources Data submissions from individual labs EMAGE: Data Sources Data submissions from individual labs ~ 2,500 images from numerous labs, all stages of development EMAGE: Data Annotation EMAGE: Data Annotation Source - journal/screen/direct submission, submitter contact details EMAGE: Data Annotation Source - journal/screen/direct submission, submitter contact details Detection reagent - Defining the reagent that was used to detect expression EMAGE: Data Annotation Source - journal/screen/direct submission, submitter contact details Detection reagent - Defining the reagent that was used to detect expression Experimental Conditions - Defining the full experimental conditions used EMAGE: Data Annotation Source - journal/screen/direct submission, submitter contact details Detection reagent - Defining the reagent that was used to detect expression Experimental Conditions - Defining the full experimental conditions used Links - Addition of specific relevant links to data in other databases EMAGE: Data Annotation Sites of gene expression EMAGE: Data Annotation Sites of gene expression - annotation to EMAP embryo Atlas EMAGE: Data Annotation Sites of gene expression - annotation to EMAP embryo Atlas EMAGE: Data Annotation Sites of gene expression - annotation to EMAP embryo Atlas Text annotation (to anatomy ontology) Detected in: central nervous system: ganglion: cranial: acoustic ganglion VIII central nervous system: ganglion: cranial: facial ganglion VII central nervous system: ganglion: cranial: glossopharyngeal IX central nervous system: ganglion: cranial: trigeminal V central nervous system: ganglion: cranial: vagus X peripheral nervous system: spinal: ganglion: dorsal root ganglion EMAGE: Data Annotation Sites of gene expression - annotation to EMAP embryo Atlas Text annotation (to anatomy ontology) Detected in: central nervous system: ganglion: cranial: acoustic ganglion VIII central nervous system: ganglion: cranial: facial ganglion VII central nervous system: ganglion: cranial: glossopharyngeal IX central nervous system: ganglion: cranial: trigeminal V central nervous system: ganglion: cranial: vagus X peripheral nervous system: spinal: ganglion: dorsal root ganglion Text annotation is based on the author/submitter description Challenge is to accurately reflect the meaning of the author description in the constraints of the ontology This process often highlights shortcomings of the ontology EMAGE: Data Annotation Sites of gene expression - annotation to EMAP embryo Atlas Spatial annotation (to virtual embryo model) strong moderate not detected EMAGE: Data Interrogation EMAGE: Data Interrogation Text based: WHAT gene expression is detected in the 1st branchial arch from TS14-TS14? EMAGE: Data Interrogation Spatial based: EMAGE: Data Interrogation Spatial based data mining: EMAGE: Data Curation EMAGE: Data Curation EMAGE: Data Curation Biocuration staff from 106 databases… 1st International Biocurator Meeting December 2006 Monterrey, USA. 2nd International Biocurator Meeting October 2007 San Jose, USA. EMAGE: Data Curation International Society for Biological Curation (ISBC) - discussion begun at the 2nd International Biocuration Meeting -currently being formed - for professional Biocurators and those who develop Biological Curations Tools and Databases - to provide a forum for interactions between Biocurators - to present a unified voice for the Biological Curation effort - to facilitate better communication and interactions between Biocurators and both Researchers and Journals EMAGE: Data Curation EMAGE: Data Curation EMAGE: Data Curation EMAGE: Data Curation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Author Description EMAGE Database Description Gene Assayed: Gsh2 MGI:94843 (current symbol Gsx2 aka Gsh2, Gsh-2) Stage: E9.5 TS16 Probe: “as used by Hsieh-Li et al” MGI:2447759 (nt 1038-1498 of S79041.1) Hsieh-Li: “a 460-bp BamHI-SmaI fragment of the Gsh-2 cDNA), which does not contain homeobox sequences". Detected in Ventro-lateral forebrain 1 61 121 181 241 301 361 421 ctgcctcggc tacctgctcc ccattgccct gcccgacttt ttgtttgctt caggaaaaac cgcatgccat tttttgttgt taacgaagac cggtaccctg cattccacct gaagctagct ttttgttgtt cagggttgat ttgcccccct tgttttaaaa aaggagattt ccctcctcct ggaaaagaaa cctctttatc tttaatgtaa taaagtttaa gtcttttcag tgaaatcatt Telencephalon (EMAP:1705) Patterm: Restricted Note: Expression restricted to ventro-lateral telencephalon) cccccttgta ccccatcagc ctctgaaaag tgggattcca atatctagaa cactgtatgg aacttgatga gaagttgcca aaggcagagg acagggacca tccggggaat ctcagttacg ttctaaccag ggggaggggt gaagaggggt t ctccttctgc aagttctagt tcaatgccgg gattggtttt tctcatatat tggttgaaga ttctttattg EMAGE: Data Curation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Author Description EMAGE Database Description Gene Assayed: Gsh2 MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2) Stage: E9.5 TS16 Probe: “as used by Hsieh-Li et al” MGI:2447759 (nt 1038-1498 of S79041.1) Hsieh-Li: “a 460-bp BamHI-SmaI fragment of the Gsh-2 cDNA), which does not contain homeobox sequences". Detected in Ventro-lateral forebrain 1 61 121 181 241 301 361 421 ctgcctcggc tacctgctcc ccattgccct gcccgacttt ttgtttgctt caggaaaaac cgcatgccat tttttgttgt taacgaagac cggtaccctg cattccacct gaagctagct ttttgttgtt cagggttgat ttgcccccct tgttttaaaa aaggagattt ccctcctcct ggaaaagaaa cctctttatc tttaatgtaa taaagtttaa gtcttttcag tgaaatcatt Telencephalon (EMAP:1705) Patterm: Restricted Note: Expression restricted to ventro-lateral telencephalon) cccccttgta ccccatcagc ctctgaaaag tgggattcca atatctagaa cactgtatgg aacttgatga gaagttgcca aaggcagagg acagggacca tccggggaat ctcagttacg ttctaaccag ggggaggggt gaagaggggt t ctccttctgc aagttctagt tcaatgccgg gattggtttt tctcatatat tggttgaaga ttctttattg EMAGE: Data Curation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Author Description EMAGE Database Description Gene Assayed: Gsh2 MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2) Stage: E9.5 TS16 Probe: “as used by Hsieh-Li et al” MGI:2447759 (nt 1038-1498 of S79041.1) Hsieh-Li: “a 460-bp BamHI-SmaI fragment of the Gsh-2 cDNA), which does not contain homeobox sequences". Quality Assurance always performed by a second curator Detected in Ventro-lateral forebrain 1 61 121 181 241 301 361 421 ctgcctcggc tacctgctcc ccattgccct gcccgacttt ttgtttgctt caggaaaaac cgcatgccat tttttgttgt taacgaagac cggtaccctg cattccacct gaagctagct ttttgttgtt cagggttgat ttgcccccct tgttttaaaa aaggagattt ccctcctcct ggaaaagaaa cctctttatc tttaatgtaa taaagtttaa gtcttttcag tgaaatcatt Telencephalon (EMAP:1705) Patterm: Restricted Note: Expression restricted to ventro-lateral telencephalon) cccccttgta ccccatcagc ctctgaaaag tgggattcca atatctagaa cactgtatgg aacttgatga gaagttgcca aaggcagagg acagggacca tccggggaat ctcagttacg ttctaaccag ggggaggggt gaagaggggt t ctccttctgc aagttctagt tcaatgccgg gattggtttt tctcatatat tggttgaaga ttctttattg EMAGE: Data Curation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Author Description EMAGE Database Description Gene Assayed: Gsh2 MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2) Objective Stage: E9.5 TS16 “as used by Hsieh-Li et al” MGI:2447759 (nt 1038-1498 of S79041.1) Hsieh-Li: “a 460-bp BamHI-SmaI fragment of the Gsh-2 cDNA), which does not contain homeobox sequences". Detected in Ventro-lateral forebrain 1 61 121 181 241 301 361 421 ctgcctcggc tacctgctcc ccattgccct gcccgacttt ttgtttgctt caggaaaaac cgcatgccat tttttgttgt taacgaagac cggtaccctg cattccacct gaagctagct ttttgttgtt cagggttgat ttgcccccct tgttttaaaa aaggagattt ccctcctcct ggaaaagaaa cctctttatc tttaatgtaa taaagtttaa gtcttttcag tgaaatcatt Telencephalon (EMAP:1705) Patterm: Restricted Note: Expression restricted to ventro-lateral telencephalon) cccccttgta ccccatcagc ctctgaaaag tgggattcca atatctagaa cactgtatgg aacttgatga gaagttgcca aaggcagagg acagggacca tccggggaat ctcagttacg ttctaaccag ggggaggggt gaagaggggt t ctccttctgc aagttctagt tcaatgccgg gattggtttt tctcatatat tggttgaaga ttctttattg EMAGE: Data Curation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Author Description EMAGE Database Description Gene Assayed: Gsh2 MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2) Objective Stage: E9.5 TS16 “as used by Hsieh-Li et al” MGI:2447759 (nt 1038-1498 of S79041.1) Probe: ‘Subjective’ Screen vs non-screen Hsieh-Li: “a 460-bp BamHI-SmaI fragment of the Gsh-2 cDNA), which does not contain homeobox sequences". 1 61 121 181 241 301 361 421 ctgcctcggc tacctgctcc ccattgccct gcccgacttt ttgtttgctt caggaaaaac cgcatgccat tttttgttgt taacgaagac cggtaccctg cattccacct gaagctagct ttttgttgtt cagggttgat ttgcccccct tgttttaaaa aaggagattt ccctcctcct ggaaaagaaa cctctttatc tttaatgtaa taaagtttaa gtcttttcag tgaaatcatt Curator Confidence of: Detected in Pattern clarity forebrain Ventro-lateral Morphology match between data and model Telencephalon (EMAP:1705) Patterm: Restricted Note: Expression restricted to ventro-lateral telencephalon) cccccttgta ccccatcagc ctctgaaaag tgggattcca atatctagaa cactgtatgg aacttgatga gaagttgcca aaggcagagg acagggacca tccggggaat ctcagttacg ttctaaccag ggggaggggt gaagaggggt t ctccttctgc aagttctagt tcaatgccgg gattggtttt tctcatatat tggttgaaga ttctttattg EMAGE: Data Curation This standardised, stable and accessible description places the data in a wider biological context for data analysis: EMAGE: Data Curation e.g. GENE IDENTIFIER: MGI:94843 EMAGE: Data Curation e.g. PROBE SEQUENCE: Nkx2-1 0.531kb probe Nkx2-1 0.702kb probe EMAGE: Data Curation e.g. PROBE SEQUENCE: Nkx2-1 0.531kb probe Why the difference? Nkx2-1 0.702kb probe EMAGE: Data Curation e.g. PROBE SEQUENCE: Nkx2-1 0.531kb probe Nkx2-1 0.702kb probe EMAGE: Data Curation Automation EMAGE: Data Curation Automation Sequence-based bioinformatics Text mining EMAGE: Data Curation Automation Sequence-based bioinformatics Text mining Automated image analysis EMAGE: Data Curation Automation Sequence-based bioinformatics Text mining Automated image analysis Raw data EMAGE: Data Curation Automation Sequence-based bioinformatics Text mining Automated image analysis Signal extraction Registration to atlas model Access to raw data Access to raw data OPEN ACCESS - True open access allows authors to retain ownership of the copyright for their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy articles, so long as the original authors and source are cited. No permission is required from the authors or the publishers. Access to raw data OPEN ACCESS - True open access allows authors to retain ownership of the copyright for their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy articles, so long as the original authors and source are cited. No permission is required from the authors or the publishers. Of 155 journals containing mouse gene expression data, only BMC, PLoS and the Biochemical Journal are truly open access, with a CC attribution (BMC, PLoS) or CC attribution non-commercial (Biochem J) licence, and xml data access. Access to raw data Open Access QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Fig7B. Copyright: This image is from [doi:10.1186/1471213X-6-56] Andrieu D; Meziane H; Marly F; Angelats C; Fernandez PA; Muscatelli F, BMC Dev Biol 2006:8, an open-access article, licensee BioMed Central Ltd. [PMID:17116257] Access to raw data OPEN ACCESS - True open access allows authors to retain ownership of the copyright for their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy articles, so long as the original authors and source are cited. No permission is required from the authors or the publishers. Of 155 journals containing mouse gene expression data, only BMC, PLoS and the Biochemical Journal are truly open access, with a CC attribution (BMC, PLoS) or CC attribution non-commercial (Biochem J) licence, and xml data access. All others still requires separate agreements between the publishers and an individual for any use apart from viewing individual papers online and printing copies during the “open access” period. This is quite inhibitory to access of the data for computational reasons (e.g. text mining, re-use of images) Access to raw data Number of suitable images Open access Non open access Journal Access to raw data Number of suitable images Open access Non open access Copyright agreement negotiated with Publishers (16 journals) Journal Access to raw data Copyright agreement negotiated with publisher QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Figure 5E. Reprinted with permission from Elsevier from [doi:10.1016/j.ydbio.2006.02.002] Dev Biol 2006 May 15;293(2):370-81, Andersson O; Reissmann E; Jornvall H; Ibanez CF, "Synergistic interaction between Gdf1 and Nodal during anterior axis development." Copyright 2006 [PMID:7556909]. Number of suitable images Access to raw data What about the rest? Journal Access to raw data Many offer permissions requests via Copyright Licensing Agency (Europe), Copyright Clearance Center (USA) or an equivalent Individual permissions: ~ £10 per image Annual Global permissions: - intended for Organisation level i.e. MRC ~ ££££££ even for non-profit organisations Access to raw data MRC - no policy regarding electronic reproduction Stéphane Goldstein - Research Information Network (London) - images should be treated as data - issues guidance to encourage good practice for researchers, funding agencies, publishers, government http://www.rin.ac.uk/data-principles Access to raw data MRC - no policy regarding electronic reproduction Leads from Stéphane to take it further: David Shotton - Reader in Image Bioinformatics, Oxford Scientific Technical & Medical Publishers Association Access to raw data No agreement for reproduction Figure 2C. [doi:10.1007/s00441-005-0036-9] Amrein L; Barraud P; Daniel JY; Perel Y; Landry M, "Expression patterns of nm23 genes during mouse organogenesis." Cell Tissue Res 2005 Dec;322(3):365-78. [PMID:16082520]. Access to raw data Proposed Data Standards Proposed Data Standards In situ gene expression is generally reported to a level that would prevent the experiment being repeated. Proposed Data Standards Inspired by MIAME - minimum information about a microarray experiment Proposed Data Standards MIACA: Minimum Information About a Cellular Assay, and the Cellular Assay Object Model MIGS: Minimum Information about a Genome Sequence MIAPE: Minimum Information About a Proteomics Experiment (Mass Spec, Gel Electrophoresis) MIAPA : Minimum Information about a Phylogenetic Analysis MIARE: Minimum Information about an RNAi experiment MI-FACE: Minimum Information about a Fluorescence Activated Cell Experiment PSI-MOD: a community standard for representation of Protein Modification Data MISFISHIE: Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments MIMIx: Minimum Information required for reporting a Molecular Interaction Experiment Aim is to have these adopted by the community and a requirement for published data Creators of different standards must communicate with each other for standardisation! Proposed Data Standards MISFISHIE - Minimum Information Specification For In Situ Hybridisation and Immunohistochemistry Experiments EMAGE: other curation tasks pdf EMAGE: other curation tasks QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Scan image from paper copy and make these available via EMAGE pdf Maintaining the EMAGE data framework Maintaining the EMAGE data framework 17 3D models ~26,000 terms over 26 stages Maintaining the EMAGE data framework EMAP ontology management ~26,000 terms over 26 stages - used by EMAGE, GXD, GUDMAP and others - available through OBO-Foundry (hosted at Berkeley) Maintaining the EMAGE data framework EMAP ontology management The EMAP ontology can be edited using OBO-edit Maintaining the EMAGE data framework EMAP ontology management Distributed editing via OBO-edit Challenge is to develop the ontology whilst maintaining the integrity of annotations already made using older versions. The curation process required to safeguard this are currenly being devised… Checkout/check-in process with changes moderated via a group of experts EMAGE: data preservation EMAGE: data preservation EMAP ontology management EMAGE data back-up / archiving Data = xml + images (.jpeg .gif .mov .mpeg .wlz) Working versions + live database + ‘backup’ copies Daily back-ups to network and tape (on-site and off-site) Short term protection No archive system currenly in place - required! EMAGE: data preservation EMAP ontology management EMAGE data back-up / archiving MRC policy - scientific data should be "kept" for 10 years, clinical data for 20 years. Al Brown - MRC were to set up an archiving/preservation centre - not to be. MRC currently aim to use a Data Support Service (one bidder is EU/EDINA/DCC) - outcome unknown as yet. Acknowledgements Shanmugasundaram Venkataraman Lorna Richardson Malcolm Fisher Jackie Finger, Terry Hayamizu, Connie Smith, Ingeborg McCright Martin Ringwald Peter Stevenson Nick Burton Yiya Yang Jiangao Rao Attilla Gyenesei Duncan Davidson Richard Baldock