Formats and standards for sequencing data Matúš Kalaš INF389, CBU, BCCS/UiB, Bergen Nov 12, 2010 454 Solexa/Illumina SOLiD … NCBI SRA EMBL-EBI ENA Your databases SHRiMP Maq BWA Bowtie RMAP Eland SOAP SOAP2 MOSAIK SOCS PatMaN ZOOM PASS PerM RazerS segemehl MPSCAN BFAST Lastz BLAT Genome Metagenome Gene annotation Gene expression Binding sites Variation … Celera Newbler Velvet Euler SOAPdenovo … GenBank EMBL DDBJ Genome Catalogue SNPdb … 454 output formats .sff .fna .qual Illumina output formats .seq.txt .prb.txt Illumina FASTQ (ASCII – 64 is Illumina score) Qseq (ASCII – 64 is Phred score) Illumina single line format SCARF SOLiD output format(s) CSFASTA Real (“standard”) FASTQ = Sanger FASTQ (ASCII – 33 is Phred score) Example of dealing with diverse read formats: in Galaxy (http://usegalaxy.org) If reads should be deposited in a public repository: SRA (Short Read Archive) at NCBI ENA at EMBL-EBI SRA format (XML) SRF format Or should they be deleted? Common (“standard”) format for read alignments: SAM BAM (= binary SAM) Some common formats for results: (Genome/Gene annotation) BED format (genome-browser tracks) GFF format (gene/genome features) BioXSD (any annotation; under development) (XML) Deposit genome/metagenome in a public repository: INSDC databases: GenBank, EMBL, DDBJ Deposit genome/metagenome metadata: MIGS/MIMS standard by GSC GCDML format (XML) (under development) following the MIGS/MIMS standard MIGS: Minimum Information about a Genome Sequence MIMS: Minimum Information about a Metagenome Sequence/Sample MIGS/MIMS checklist: MIGS/MIMS metadata example: Sequencing experiment metadata: MINSEQE standard by FGED Minimum Information about a high-throughput Nucleotide SEQuencing Experiment (under development) Take-home messages: • Use raw sequencing data when possible • For base-call data, use “standard” FASTQ (Sanger, Phred) • For read alignments, use SAM/BAM format • Use common formats for your results (e.g. GFF or BED format) • Hope for new, generic, extensible standard format(s) • Submit MIGS/MIMS-compliant metadata of genome sequences • Keep an eye on MINSEQE standard, store your sequencing metadata