Submission_tutorial_MAGE

Data submission to AE
1
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
2
ArrayExpress
MAGE-TAB Example
3
ArrayExpress & Atlas
MAGE-TAB Submission
Ontology Link
Submitter is directed to
either submit an
experiment or
download a template
to the desktop
4
ArrayExpress & Atlas
MAGE-TAB Submission
5
ArrayExpress & Atlas
Submission of HTS gene expression data
• Submit via MAGE-TAB submission route
• Submit:
• MAGE-TAB spreadsheet containing details of the samples and
protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )
• Processed data files
• For non-human species we will supply your SRF or FASTQ files to
the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit
to the The European Genome-phenome Archive and not
ArrayExpress. They will supply you with a suitable template for
submission and store human identifiable data securely.
6
ArrayExpress & Atlas
Types of data that can be submitted
7
ArrayExpress & Atlas
MAGE-TAB Example: IDF
Value indicating which sequencing instrument
was used (e.g. 454 GS, Illumina Genome
Analyzer, AB SOLiD System).
MAGE-TAB Example: SDRF
Source Name
finch 1
finch 2
finch 3
finch 4
finch 5
finch 6
finch 7
finch 8
finch 9
finch 10
Material Type
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
whole_organism
Term Source REF
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
MGED Ontology
Characteristics[Organism]
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Geospiza fortis
Characteristics[Sex]
male
male
male
male
male
male
male
male
male
male
Characteristics[StrainOrLine]
Pinta
Pinta
Marchesa
Marchesa
Santiago
Santiago
Floreana
Floreana
Pinzon
Pinzon
Protocol REF
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
EXTRACTION
MAGE-TAB Example: SDRF
Protocol REF
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
MY SEQ PROTOCOL
10
ArrayExpress & Atlas
Performer
MWG
MWG
MWG
MWG
MWG
MWG
MWG
MWG
MWG
MWG
Assay Name
pinta 1
pinta 2
marchesa 1
marchesa 2
santiago 1
santiago 2
floreana 1
floreana 2
pinzon 1
pinzon 2
Technology Type
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
high_throughput_sequencing
COMMENT[FLOW_SEQUENCE] COMMENT[FLOW_COUNT] Array Data File
TACG
800 run1.fastq
TACG
800 run2.fastq
TACG
800 run3.fastq
TACG
800 run4.fastq
TACG
800 run5.fastq
TACG
800 run6.fastq
TACG
800 run7.fastq
TACG
800 run8.fastq
TACG
800 run9.fastq
TACG
800 run10.fastq
What needs to be included in the Spreadsheet?
•
Include Assay Name and Technology Type columns
•
Raw files must go in the Array Data File column
•
A sequencing protocol must be provided.
• The sequencing protocol should have a performer- this is used as the run center name.
• This protocol must have a Protocol Hardware value saying which sequencing instrument was used
(e.g. 454 GS, Illumina Genome Analyzer, AB SOLiD System
• Reference this in the Protocol REF column before the Assay Name column.
•
These 4 extra Comment[] columns should be added after Extract Name to provide information about
how the library was prepared
•
•
•
•
11
Comment[LIBRARY_LAYOUT]- either SINGLE or PAIRED
Comment[LIBRARY_SOURCE]- one of GENOMIC, NON GENOMIC, SYNTHETIC, VIRAL RNA,
OTHER
Comment[LIBRARY_STRATEGY] - one of WGS, WCS, CLONE, POOLCLONE, AMPLICON,
BARCODE, CLONEEND, FINISHING, ChIP-Seq, MNase-Seq, EST, FL-cDNA, CTS, OTHER
Comment[LIBRARY_SELECTION]- one of RANDOM, PCR, RANDOM PCR, RT-PCR, HMPR, MF,
CF-S, CF-M, CF-H, CF-T, MSLL, cDNA, ChIP, MNase, other, unspecified
ArrayExpress & Atlas
Platform Specific Attributes
Include the following attributes as Comment[] columns after Assay Name:
•
For LS454:
• KEY_SEQUENCE (string - The first bases that are expected to be produced by the
challenge bases)
• FLOW_SEQUENCE (value is a string, e.g. TACG)
• FLOW_COUNT (value is an integer)
•
For Illumina:
• SEQUENCE_LENGTH (integer - The fixed number of bases expected in each raw
sequence, including both mate pairs and any technical reads.)
•
For Helicos:
• FLOW_SEQUENCE
• FLOW_COUNT
•
For ABI SOLID:
• SEQUENCE_LENGTH (integer - The fixed number of bases expected in each raw
sequence, including both mate pairs and any technical reads.)
12
ArrayExpress & Atlas
MAGE-TAB Submission
Indicate submission
type
Ontology Link
Submitter is directed to
either submit an
experiment or
download a template
to the desktop
13
ArrayExpress & Atlas
MAGE-TAB Submission
14
ArrayExpress & Atlas
What happens after submission?
MAGE-TAB spreadsheet, raw and
processed data files
MTAB to SRA conversion
script
Submit
SRA XML, raw data
Curation
ENA
Linked by accessions
MAGE-TAB,
processed data
ArrayExpress Archive
What happens after submission?
• Email confirmation
• Curation
• The curation team will review your submission and will
email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the
required information has been provided.
• We will load your experiment into ArrayExpress and
provide you with a reviewer login for viewing the data
before it is made public.
16
ArrayExpress & Atlas