slides (see also notes)

advertisement
Formats and standards
for sequencing data
Matúš Kalaš
INF389, CBU, BCCS/UiB, Bergen
Nov 12, 2010
454
Solexa/Illumina
SOLiD
…
NCBI SRA
EMBL-EBI ENA
Your databases
SHRiMP
Maq
BWA
Bowtie
RMAP
Eland
SOAP
SOAP2
MOSAIK
SOCS
PatMaN
ZOOM
PASS
PerM
RazerS
segemehl
MPSCAN
BFAST
Lastz
BLAT
Genome
Metagenome
Gene annotation
Gene expression
Binding sites
Variation
…
Celera
Newbler
Velvet
Euler
SOAPdenovo
…
GenBank
EMBL
DDBJ
Genome Catalogue
SNPdb
…
454 output formats
.sff
.fna
.qual
Illumina output formats
.seq.txt
.prb.txt
Illumina FASTQ
(ASCII – 64 is Illumina score)
Qseq
(ASCII – 64 is Phred score)
Illumina single line format
SCARF
SOLiD output format(s)
CSFASTA
Real (“standard”) FASTQ = Sanger FASTQ
(ASCII – 33 is Phred score)
Example of dealing with diverse read formats:
in Galaxy
(http://usegalaxy.org)
If reads should be deposited in a public repository:
SRA (Short Read Archive) at NCBI
ENA at EMBL-EBI
SRA format (XML)
SRF format
Or should they be deleted?
Common (“standard”) format for read alignments:
SAM
BAM
(= binary SAM)
Some common formats for results:
(Genome/Gene annotation)
BED format
(genome-browser tracks)
GFF format
(gene/genome features)
BioXSD
(any annotation; under development)
(XML)
Deposit genome/metagenome in a public repository:
INSDC databases: GenBank, EMBL, DDBJ
Deposit genome/metagenome metadata:
MIGS/MIMS standard by GSC
GCDML format
(XML)
(under development)
following the MIGS/MIMS standard
MIGS:
Minimum Information about a Genome Sequence
MIMS: Minimum Information about a Metagenome
Sequence/Sample
MIGS/MIMS checklist:
MIGS/MIMS
metadata example:
Sequencing experiment metadata:
MINSEQE standard by FGED
Minimum Information about a high-throughput
Nucleotide SEQuencing Experiment
(under development)
Take-home messages:
• Use raw sequencing data when possible
• For base-call data, use “standard” FASTQ (Sanger, Phred)
• For read alignments, use SAM/BAM format
• Use common formats for your results (e.g. GFF or BED format)
• Hope for new, generic, extensible standard format(s)
• Submit MIGS/MIMS-compliant metadata of genome sequences
• Keep an eye on MINSEQE standard, store your sequencing metadata
Download