Time line and procedures for datasets BCBC Pre-retreat Workshop Tyson’s Corner, VA May 11, 2011 Topics to cover • Timeline for a dataset from contact to web site • Policies to follow and documents to use • Ten questions about your dataset • Creating a MAGE-TAB document with us • Seeing your dataset on the Beta Cell web site • A tool you can use for MAGE-TAB: Annotare Datasets to Contact us about • Your deliverables – Microarray experiments – High Throughput sequencing experiments (RNA-seq, ChIP-seq, FAIRE-seq, etc.) – RT-PCR screens – Other deliverables – we can discuss how to integrate • Other key datasets – From your lab but from different funding – From the literature Steps to get a study into Beta Cell • Contact us. Let us know what is coming and when so we can schedule working with you. • Fill out the Ten Questions. When we get this from you, we can generate an initial spreadsheet (MAGE-TAB) for you to complete. • Fill out highlighted areas of the MAGE-TAB. We will go back and forth with you on details to get it right. • Send us your data. We will set up a FTP account for you. Send us the raw data (e.g., Affymetrix CEL files, FASTQ sequence reads) and the processed data that the conclusions are based upon. • Set a release schedule. We will load the dataset and incorporate into queries and web pages as appropriate. We need to set when to release to the BCBC and to the general public. – We can also submit your data to ArrayExpress or, if desired, GEO. • View/Query your dataset. Beta Cell has releases every 3 to 4 months. Timeline • Completion of MAGE-TAB: – Requires back and forth between the CC and the contact person in the investigator’s lab – Time to completion depends on responsiveness of such a contact person – Until the MAGE-TAB is completed, data loading cannot occur • Data loading: – Once the MAGE-TAB is completed and all necessary files have been delivered, time to load the data depends on the size of your study – For a typical study data loading takes a few weeks – Missing files will delay the process • Keep in mind that when you contact us to submit a study, you will be put in a queue and the process of getting your study into Beta Cell Genomics will start once you reach the top of the queue • Studies that are meant to be viewable on the BCBC website (either by the general public or by BCBC investigators only) have priority over private studies, i.e. a study which is to be kept private will be placed lower in the queue Policies to follow and documents to use • Ten Questions about your dataset – Available as a BCBC miscellaneous resource – http://www.betacell.org/resources/data/miscell aneous/ • Bioinformatics/Epigenomics Working group – RNA-seq and ChIP-seq recommendations • Includes checklists for data and information to provide – Mike Snyder will provide overview and discuss Meeting Deliverables • For a study to be considered fully “delivered”, the following is required on the investigator’s part: – Provide answers to the initial 10 questions and all necessary data files – Respond to all inquiries needed to generate an accurate MAGE-TAB – Allow your study to be visible (at least by other BCBC investigators) on the Beta Cell website Topics to cover • Timeline for a dataset from contact to web site • Policies to follow and documents to use • Ten questions about your dataset • Creating a MAGE-TAB document with us • Seeing your dataset on the Beta Cell web site • A tool you can use for MAGE-TAB: Annotare MGED Standards • What information is needed for a microarray experiment? – MIAME: Minimal Information About a Microarray Experiment. Brazma et al., Nature Genetics 2001 • How do you “code up” microarray data? – MAGE-OM: MicroArray Gene Expression Object Model. Spellman et al., Genome Biology 2002 – MAGE-TAB Rayner et al., BMC Bioinformatics 2006 • What words do you use to describe a microarray experiment? – MO: MGED Ontology. Whetzel et al. Bioinformatics 2006 MIAME in a nutshell (ala Alvis Brazma) Sample Sample Sample Sample Sample Experiment Array design RNA extract RNA extract RNA extract RNA RNAextract extract labelled labelled labelled labelled nucleic acid labeled nucleic acid nucleic acid nucleic nucleicacid acid Protocol Protocol Protocol Protocol Protocol Protocol Stoeckert et al. Drug Discovery Today TARGETS 2004 genes hybridisation hybridisation hybridisation hybridisation hybridization array array array array Microarray Gene expression data matrix normalization integration Sequencing is replacing array technology Sample Sample Sample Sample Sample RNA extract RNA extract RNA extract RNA RNAextract extract labelled labelled labelled labelled nucleic acid nucleic acid nucleic acid nucleic nucleicacid acid Protocol Protocol Protocol Protocol Protocol Protocol genes Experiment Array design @HWI-EAS266_0011:8:1:6:969#0/1 GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT +HWI-EAS266_0011:8:1:6:969#0/1 _abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY @HWI-EAS266_0011:8:1:7:1688#0/1 AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG +HWI-EAS266_0011:8:1:7:1688#0/1 a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR @HWI-EAS266_0011:8:1:7:593#0/1 CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG +HWI-EAS266_0011:8:1:7:593#0/1 abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_ @HWI-EAS266_0011:8:1:7:139#0/1 CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT +HWI-EAS266_0011:8:1:7:139#0/1 aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX @HWI-EAS266_0011:8:1:7:1390#0/1 GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA +HWI-EAS266_0011:8:1:7:1390#0/1 _U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX @HWI-EAS266_0011:8:1:7:1663#0/1 TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG +HWI-EAS266_0011:8:1:7:1663#0/1 a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB hybridisation hybridisation hybridisation hybridisation hybridisation array array array array Microarray Gene expression data matrix normalization integration Sequencing is replacing array technology Sample Sample Sample Sample Sample RNA extract RNA extract RNA Chromatin, RNAextract extract DNA extract labelled labelled labelled labelled nucleic acid nucleic nucleic acid nucleic acid nucleic acid acid Protocol Protocol Protocol Protocol Protocol Protocol genes Experiment Array design @HWI-EAS266_0011:8:1:6:969#0/1 GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT +HWI-EAS266_0011:8:1:6:969#0/1 _abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY @HWI-EAS266_0011:8:1:7:1688#0/1 AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG +HWI-EAS266_0011:8:1:7:1688#0/1 a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR @HWI-EAS266_0011:8:1:7:593#0/1 CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG +HWI-EAS266_0011:8:1:7:593#0/1 abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_ @HWI-EAS266_0011:8:1:7:139#0/1 CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT +HWI-EAS266_0011:8:1:7:139#0/1 aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX @HWI-EAS266_0011:8:1:7:1390#0/1 GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA +HWI-EAS266_0011:8:1:7:1390#0/1 _U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX @HWI-EAS266_0011:8:1:7:1663#0/1 TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG +HWI-EAS266_0011:8:1:7:1663#0/1 a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB hybridisation hybridisation hybridisation hybridisation hybridisation array array array array Microarray ChiP-Seq MeDIP-Seq Etc. normalization integration From MGED to FGED • What information is needed for an HTS experiment? – MINSEQE: Minimum Information about a highthroughput SeQuencing Experiment • How do you “code up” functional genomics data? – MAGE-TAB can still be utlized • What words do you use to describe a functional genomics experiment? – OBI: Ontology for Biomedical Investigations, incorporates MO MAGE-TAB Format What is MAGE-TAB? • A simple spreadsheet view consisting of 2 files: – IDF: describing the experiment design, contact details, variables, and protocols – SDRF: a spreadsheet with columns that describe samples, annotations, protocol references, assays, and data – Linked data files (e.g. CEL files) are referenced by the SDRF Where can I get MAGE-TAB from? • ~10,000 MAGE-TAB files are available from ArrayExpress (includes GEO derived and ArrayExpress data) • caArray also provides MAGE-TAB files for download Who is using MAGE-TAB? • BioConductor • GenePattern • MeV • and Beta Cell Genomics! IDF file for E-TABM-34 IDF = Investigation Description Format SDRF file for E-TABM-34 SDRF = Sample and Data Relationship Format A microarray expression study IDF Experimental Design Following 1 sample: bench component OrganismPart black border = biomaterials red border = treatments in-silico component image acquisition feature extraction summarization (feature extraction II) and quantile normalization SDRF Let’s focus on the highlighted row From design to MAGE-TAB From design to MAGE-TAB Viewing the Annotation Querying the Annotation Loading and Analyzing the Data • Image and .CEL files are archived and their location stored in the database • Raw and processed data loaded into the database • Downstream analyses (e.g. differential expression) are performed, generating gene lists • Analysis results loaded into the database Querying the Data A ChIP-Seq study IDF Experimental Design Bench Component In-silico Component Ptf1a_s5 Ptf1a_s5_seq.txt s5_eland.txt Ptf1a_peaks Ptf1a_s4 Ptf1a_s4_seq.txt s4_eland.txt Input_s8 Input_s8_seq.txt s8_eland.txt Rbpjl_s6 Rbpjl_s6_seq.txt s6_eland.txt Input_s2 Input_s2_seq.txt s2_eland.txt Rbpjl_peaks Rbpjl_s4 Rbpjl_s4_seq.txt s4_eland.txt cluster generation image acquisition sequencing alignment peak calling SDRF Viewing the Annotation Querying the Annotation Viewing the Data Querying the Data Topics to cover • Time line for a dataset from contact to web site • Policies to follow and documents to use • Ten questions about your dataset • Creating a MAGE-TAB document with us • Seeing your dataset on the Beta Cell web site • A tool you can use for MAGE-TAB: Annotare Annotare - An open source standalone MAGE-TAB editor Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M, Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA. Annotare - a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics. 2010 Aug 23. Annotare - an open source MAGE-TAB Editor Annotare is an annotation tool for high throughput gene expression experiments in MAGE-TAB format. Researchers can describe their investigations with the investigators’ contact details, experimental design, protocols that were employed, references to publications, details of biological samples, arrays, and experimental data produced in the investigation. Annotare Features • Intuitive graphical user interface forms for editing • Ontology support, an inbuilt ontology and web services connectivity to bioportal • Searchable standard templates • Design wizard • Validation module • Mac and Windows Support http://code.google.com/p/annotare/ Annotare Demo • File Gallery: Three different ways to get started • Looking at an existing MAGE-TAB – Form versus sheet view • Using a template • Using the wizard