CRAM: reference-based compression format developed by Vadim Zalunin EBI is an Outstation of the European Molecular Biology Laboratory. Data horror EMBL-EBI 10 petabytes SRA ~1 petabytes Over 2 million DVDs or 2.5km Complete Genomics 0.5 TB for a single file The need for compression Red alert Compression, what is it? BMP, 190 kb PNG, 100 kb LOSSLESS JPG, 21 kb JPG, 4 kb LOSSY Compression, when we know what to expect. BMP, 145 kb PNG, 2 kb LOSSLESS JPG, 6 kb JPG, 3 kb LOSSY But the actual message is only 40 characters (bytes) long! Compression at it’s best compress IMAGE, 145 kb "Five little ducks went swimming one day" uncompress TEXT, 40 b ~3500 times more efficient IMAGE, 145 kb What are we talking about bug sample The bug’s DNA is hidden somewhere sequencing machines bunch of huge files Looking closer at the data bunch of huge files It boils down to a long list of reads: read 1 read 2 read 3 ….. read bizzilion Each read represents a short nucleotide sequence from the genome. Additional information may be attached to it, for example error estimates. What is a Read? @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. What is a Read? read name @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… An excerpt from of a FASTQ file. Bases: ACGTN What is a Read? read name read bases @SRR081241.20758946 CCAGATCCTGGCCCTAAACAGGTGGTAAGGAAGGAGAGAGTG… + IDCEFFGGHHGGGHIGIHGFEFCFFDDGFFGIIHHIGIHHFI… read quality scores An excerpt from of a FASTQ file. Bases: ACGTN Quality scores: from ‘!’ (ASCII 33) to ‘~’ (ASCII 126) What is quality score? Then quality score is phred quality score encoded as ASCII symbols 33-126. Basically: higher scores are better, so ‘!’ is bad, ‘I’ is good. Reference based encoding Reference sequence read 1 read 2 read 3 read 4 read 5 T G A G C T C T A A G T A C C C G C G G T C T G T C C G T G A G C T C T T A G T A G C G C T C T A A G T A G C C G C C T C T A A G T A G C C G C G Read start position G T A G C C G C G G A C T G T C G G T C T G T C C G Read end position Reference based encoding Reference sequence read 1 read 2 read 3 read 4 read 5 T G A G C T C T A A G T A C C C G C G G T C T G T C C G . . . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . Reference based encoding Reference sequence read 1 read 2 read 3 read 4 read 5 T G A G C T C T A A G T A C C C G C G G T C T G T C C G . . . . . . . . T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A . . . . . . . . . . . . . . . Mismatching bases Lossy quality scores Approach 1 Quality scores are usually values from 0 to 39. Let’s shrink them, so that they are from 0 to 7 now. Approach 2 Let’s treat quality scores using alignment information. For example: preserve only quality scores for mismatching bases. Comparison study:1K Genomes exomes BAM compress CRAM uncompress BAM Comparison study:1K Genomes exomes BAM Some analysis pipeline compress CRAM uncompress BAM Some analysis pipeline Comparison study:1K Genomes exomes BAM compress CRAM uncompress BAM Some analysis pipeline Some analysis pipeline Original SNPs Restored SNPs Comparison study:1K Genomes exomes CRAM NGS data compression Untreated CRAM lossless CRAM lossy CRAM very lossy Bits/base (bad) (good) Do nothing Lossless Lossy Progressive application of compression Hard Sample accessibility 20-fold Lossless 200-fold 2-fold Easy Low High Sample value References More information: http://www.ebi.ac.uk/ena/about/cram_toolkit Mailing list: http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev Publications: Fritz, M.H. Leinonen, R., et al. (2011) Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21 (5), 734-40 Cochrane G., Cook C.E. and Birney E. (2012) The future of DNA sequence archiving. Gigascience 1