Mouse Genetics January 24, 2013, 14:00– 16:30 (A tool kit for) Deep Seq Data Analysis Christophe.antoniewski@upmc.fr http://drosophile.org Why deep seq analyses ? Your project involves: Mutation and SNP identification or analysis (genome re-sequencing) Gene/Disease Linkage (genome re-sequencing) Pathogen identification (de novo sequence assembly or re-sequencing) transcriptome analysis (RNAseq) DNA methylation study (medip-seq) Chromatin study (ChIPseq) Transcription factor study (ChIPseq) miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq) Single cell transcriptome analysis Qualitative information Deep seq Quantitative information Sequencing Technologies Sequencing Technologies : Quantitative Facts From Sequencing Technologies : Focus on Illumina technology « Library » « Cluster s» Cluster Sequences For mRNA seq, non Directional small RNAseq library preparation (Directional) (Biases) 20-30nt RNA gel purification Library “Bar coding” ChIPseq library preparation (Non Directional) What can I do with my sequence reads ? Locus discovery/mutation discovery/Splicing annotation Annotation & visualization Read quantitative profiling (Transcriptome, chromatin profiling, et Statistics Structure analysis of precursors, signatures… Maths & Statistics … Flowchart of a sequencing projec What am I going to sequence ? Platform Selection Inherent biases Whole genome Whole exome Target enrichment Number of Cycles Adapter Clipping Quality trimming Bowtie BWA…… Nature Methods 2009 P Flicek & E Birney Specific benefits (Read length, single or paired ends, number of reads Library Preparation Sequencing Quality Control Alignment Size selection Amplification Single Cell Protocol Number of lanes Contaminant and Error identification Assembly Velvet SSAKE…… PLoS ONE 6(3) Zhang W, Chen J, et al. (2011) Visualization & Statistics •Normalization (library comparison) •Peak finding (Binding sites, Breakpoints, etc…) •Differential Calling (expression, variants, etc) Think to the number of replicate when starting R & Open Source software tools A case study: miRNAs (and other small RNAs) Hen1 met + snoRNA, tRNA, rRNA fragments small RNAseq library preparation (Directional) (Biases) 20-30nt RNA gel purification Library “Bar coding” Basic Material A sequence file (fastq format) A computer with enough RAM (8 Gigabytes is a good start) A Unix compliant Operating System + a bit of « basic know how » A couple of very useful softwares with Graphic User Interface (GUI) TextWrangler, an advanced text editor with RegEx integration R (for statistics and, more importantly, Graphics) … GALAXY is an (our) option Knowledge of at least one programming language Python, Perl, Java, C++… What is this big* fastq file containning ? * Size limit to open a text file with a text editor (~1.2 Gb) Unix Terminal . more <path/to/the/file> $ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh @HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 ]B]VWaaaaaagggfggggggcggggegdgfgeggbab @HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh @HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff @HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT +HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 aB\^^eeeeegcggfffffffcfffgcgcfffffR^^] @HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC +HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^ … … … Header Sequence Header Quality (ascii encoded) How many sequence reads in my file ? wc - l <path/to/my/file> $ wc -l GKG-13.fastq 25703828 GKG-13.fastq #!/usr/bin/python import sys >>> 25 703 828 / 4 6 425 957 sequence reads $ fastq_complexity.py GKG-13.fastq 6 425 957 reads 550 706 distinct sequences 0.085700 complexity readDic= {} Nbre_reads = 0 Nbre_lines = 0 500,000 F = open(sys.argv[1]) for line in F: 400,000 Nbre_lines += 1 if Nbre_lines % 4 == 2: 300,000 Nbre_reads += 1 readDic[line] = readDic.get(line, 0) + 1 200,000 F.close() print "%s reads" % Nbre_reads 100,000 print "%s distinct sequences" % (len(readDic)) print "%f complexity" % (len(readDic)/float(Nbre_reads)) 0 0 5 10 15 20 25 Are my sequence reads containing the adapter ? My 3’ adaptater: CTGTAGGCACCATCAAT cat <path/file> | grep CTGTAGG | wc –l grep -c "CTGTAGG" <path/file> lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l 6355061 lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq 6355061 6 355 061 out of 6 425 957 sequences … not bad (98.8%) A contrario lbcd-05:GKG13demo deepseq$ cat 308 GKG-13.fastq | grep ATCTCGT| wc -l Take home message n°1 Unix Operating Systems already contain powerful native tools for text analysis Regular expression $ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l wc with –l option perl interpreter is counts the lines called with –ne options (loop & The output is execute) In line perl code The output is passed to the input passed to the input of the next of the next command command Outputs the content of a file, line by line Quality Control. Can I trust my sequences ? http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Demo with the GUI version Quality Control. Can I trust my sequences ? fastQC in GALAXY How to clip the adapter ? http://hannonlab.cshl.edu/fastx_toolkit/index.html 3’ adapter: CTGTAGGCACCATCAAT fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta Clipping in GALAXY http://bowtie-bio.sourceforge.net/ Bowtie aligns reads on indexed genomes • Download, install Bowtie and rtfm. • Download your genome (format FASTA) • Build the Bowtie index using bowtie-build deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37 ~nn min (but indexed references available) deepseq$ ls –laht -rw-r--r-- 1 deepseq -rw-r--r-- 1 deepseq -rw-r--r-- 1 deepseq -rw-r--r-- 1 deepseq -rw-r--r-- 1 deepseq -rw-r--r-- 1 deepseq staff staff staff staff staff staff 49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt 19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt 49M Mar 24 17:20 dmel-r5.37.1.ebwt 19M Mar 24 17:20 dmel-r5.37.2.ebwt 331K Mar 24 17:16 dmel-r5.37.3.ebwt 39M Mar 24 17:16 dmel-r5.37.4.ebwt A bowtie alignment (Demo on Mac) Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated # reads processed: 5997502 # reads with at least one reported alignment: 5045151 (84.12%) # reads that failed to align: 952351 (15.88%) Reported 5045151 alignments to 1 output stream(s) Bowtie outputs deepseq$ ls -laht -rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated -rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa -rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa Tabular alignment report deepseq$ more GKG13_bowtie_output.tabulated 21 + 2L 20487495 TGGAATGTAAAGAAGTATGGAG 30 3L 15836559 GTGAATTCTCCCAGTGCCAAG 25 + 3R 5916902 TGAACACAGCTGGTGGTATCC 23 2L 11953462 CCCGTGAATTCTTCCAGTGCCATT 27 + 3R 5916902 TGAACACAGCTGGTGGTATC 26 3R 9289997 TCCTGCGGCACTAGTACTTA 18 2L 11953465 GTGAATTCTTCCAGTGCCATT 22 3R 8377246 ATTGCTGGAATCAAGTTGCTGAC 20 + 3L 11650036 TTTGTGACCGACACTAACGGGTA 24 + 2R 16493585 TGGAAGACTAGTGATTTTGTT 28 + 3L 10358380 TAGGAACTTCATACCGTGCTCT 35 + X 18022302 CTTGTGCGTGTGACAGCGGCT 41 3RHet 138608 TGGCGACCGTGACAGGACCCG 42 + 3R 5916902 TGAACACAGCTGGTGGTATCC Aligned sequences Unaligned sequences deepseq$ more droso_matched_GKG-13.fa >21 TGGAATGTAAAGAAGTATGGAG >26 TAAGTACTAGTGCCGCAGGA >24 TGGAAGACTAGTGATTTTGTT >23 AATGGCACTGGAAGAATTCACGGG >27 TGAACACAGCTGGTGGTATC deepseq$ more unmatched_GKG13.fa >29 AGGGGGCTATTTCACTACTGGA >33 CGATGATGACGGTACCCGTAGA >37 GCTAGTCGGTACTTGAAAC >59 TGGTTGCAATAGCTTCTGGCGGA >61 GATGAGTGCTAGATGTAGGGA A pipeline for small RNA annotation (see in GED Galaxy) Sequence reads (fasta format) Bowtie Pre-miRNAs (miRBase) Unmatched reads Bowtie Non coding RNAs Unmatched reads Bowtie Transposons Matched reads (fasta) Read Count Matched reads (fasta) Read Count hierarchical Matched reads (fasta) Read Count Matched reads (fasta) Read Count annotation Unmatched reads Bowtie Genes of sequence Unmatched reads Bowtie Matched reads (fasta) Read Count Viruses, transgenes, etc…Matched reads (fasta) Read Count Intergenic regions Unmatched reads Bowtie Remaining unmatched sequences datasets samtools http://samtools.sourceforge.net/ Sam format Bam format (for Genome Browsers) • Sorted • Indexed • Compressed Preparation of a BAM file and its associated index $ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtools view -bS -o GKG-13_clippipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ; samtools index GKG-13_clippipe.fasta.bam.sorted.bam 306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai 42M GKG-13_clip-pipe.fasta.bam.sorted.bam 80M GKG-13_clip-pipe.fasta.bam ~3 min Read visualization in a Genome Browser Upload of BAM file to a remote server (amazon cloud) Passing the URL to Ensembl (Gbrowse, Modencode, etc..) Naive and primed murine pluripotent stem cells have distinct miRNA signatures A. Jouneau (INRA Jouy en Josas) E. Heard (Institut Curie) C. Antoniewski (Institut Pasteur) M. Cohen-Tannoudji (Institut Pasteur) ESC1 ESC2 EpiSC3EpiSC1EpiSC2 miRNA profiling deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt Sequence reads (fasta format) Bowtie Pre-miRNAs (miRBase) Bowtie Output Parsing Read maps for all miRNAs Hit list for miR_5p and miR_3p miR profiling / hit list agregation Differential Calling deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt Sequence reads (fasta format) Bowtie Pre-miRNAs (miRBase) Bowtie Output Parsing Read maps for all miRNAs Hit list for miR_5p et miR_3p http://www.r-project.org/ DESeq Heatplus (Bioconductor) touRism Load DESeq, gplots and RcolorBrewer countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE ) head(countsTable) rownames(countsTable)<- countsTable$gene countsTable <- countsTable[ , -1 ] head(countsTable) summary(countsTable) plot(countsTable) plot(log(countsTable,10)) conds <- c( "EPI", "EPI", "EPI", "ES", "ES" ) cds <- newCountDataSet( countsTable, conds ) cds <- estimateSizeFactors( cds ) sizeFactors( cds ) cds = estimateDispersions( cds, method="pooled") vsd <- getVarianceStabilizedData( cds ) dists <- dist( t( vsd ) ) heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1) SampleVar<-apply(vsd,1,var) vsd2<-cbind(vsd,SampleVar) vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),] head(vsd3) vsd3<-head(vsd3,100) vsd3<-vsd3[,-6] head(vsd3) heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column", density.info="none", keysize=0.7) cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only") res = nbinomTest( cds, "EPI", "ES" ) resNA = res[-which(is.na(res[,8])),] resNA[order(resNA[,8]), ] Deep Seq Data Analysis, Final Take Home Messages Think to your deep seq replicates at starting Keep a hand on your data, from « fastq stage » Keep a hand on the analysis because this is your project Always keep an eye on « Normalization » and « Differential » Don’t be afraid by bioinformatics, but don’t reinvent the wheel It’s open source, open manual It’s not magic, yes you can It’s fun You cannot escape, so take it easy.