Sequencing Technologies

advertisement
Mouse Genetics
January 24, 2013, 14:00–
16:30
(A tool kit for) Deep Seq Data Analysis
Christophe.antoniewski@upmc.fr
http://drosophile.org
Why deep seq analyses ?
Your project involves:









Mutation and SNP identification or analysis (genome re-sequencing)
Gene/Disease Linkage (genome re-sequencing)
Pathogen identification (de novo sequence assembly or re-sequencing)
transcriptome analysis (RNAseq)
DNA methylation study (medip-seq)
Chromatin study (ChIPseq)
Transcription factor study (ChIPseq)
miRNAs, siRNA, piRNA, tRF, etc... (small RNA seq)
Single cell transcriptome analysis
Qualitative information
Deep seq
Quantitative information
Sequencing Technologies
Sequencing Technologies : Quantitative
Facts
From
Sequencing Technologies : Focus on Illumina
technology
« Library
»
« Cluster
s»
Cluster
Sequences
For mRNA seq, non Directional
small RNAseq library preparation
(Directional)
(Biases)
20-30nt RNA gel purification
Library “Bar coding”
ChIPseq library preparation
(Non Directional)
What can I do with my sequence reads ?

Locus discovery/mutation discovery/Splicing annotation
 Annotation & visualization

Read quantitative profiling (Transcriptome, chromatin profiling, et
 Statistics

Structure analysis of precursors, signatures…
 Maths & Statistics

…
Flowchart of a sequencing projec
What am I going to sequence ?
Platform
Selection
Inherent biases
Whole genome
Whole exome
Target enrichment
Number of Cycles
Adapter Clipping
Quality trimming
Bowtie
BWA……
Nature Methods 2009
P Flicek & E Birney
Specific benefits
(Read length, single or paired ends, number of reads
Library
Preparation
Sequencing
Quality Control
Alignment
Size selection
Amplification
Single Cell Protocol
Number of lanes
Contaminant and Error identification
Assembly
Velvet
SSAKE……
PLoS ONE 6(3)
Zhang W, Chen J, et al. (2011)
Visualization & Statistics
•Normalization (library comparison)
•Peak finding (Binding sites, Breakpoints, etc…)
•Differential Calling (expression, variants, etc)
Think to the number of replicate when starting
R
& Open Source software
tools
A case study: miRNAs (and other small
RNAs)
Hen1
met
+
snoRNA, tRNA, rRNA fragments
small RNAseq library preparation
(Directional)
(Biases)
20-30nt RNA gel purification
Library “Bar coding”
Basic Material

A sequence file (fastq format)

A computer with enough RAM (8 Gigabytes is a good start)

A Unix compliant Operating System + a bit of « basic know how »

A couple of very useful softwares with Graphic User Interface (GUI)

TextWrangler, an advanced text editor with RegEx integration

R (for statistics and, more importantly, Graphics)

…

GALAXY is an (our) option

Knowledge of at least one programming language

Python, Perl, Java, C++…
What is this big* fastq file containning ?
* Size limit to open a text file with a text editor (~1.2 Gb)
 Unix Terminal .
 more <path/to/the/file>
$ more GKG-13.fastq
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1
bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh
@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
]B]VWaaaaaagggfggggggcggggegdgfgeggbab
@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh
@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
aBa^\ddeeehhhhhhhhhhhhhhhhghhhhhhhefff
@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT
+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
aB\^^eeeeegcggfffffffcfffgcgcfffffR^^]
@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC
+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
…
…
…
Header
Sequence
Header
Quality (ascii encoded)
How many sequence reads in my file ?
 wc - l <path/to/my/file>
$ wc -l
GKG-13.fastq
25703828 GKG-13.fastq
#!/usr/bin/python
import sys
>>> 25 703 828 / 4
6 425 957 sequence reads
$ fastq_complexity.py GKG-13.fastq
6 425 957 reads
550 706 distinct sequences
0.085700 complexity
readDic= {}
Nbre_reads = 0
Nbre_lines = 0
500,000
F = open(sys.argv[1])
for line in F:
400,000
Nbre_lines += 1
if Nbre_lines % 4 == 2:
300,000
Nbre_reads += 1
readDic[line] = readDic.get(line, 0) + 1
200,000
F.close()
print "%s reads" % Nbre_reads
100,000
print "%s distinct sequences" % (len(readDic))
print "%f complexity" % (len(readDic)/float(Nbre_reads))
0
0
5
10
15
20
25
Are my sequence reads containing the adapter ?
My 3’ adaptater: CTGTAGGCACCATCAAT
 cat <path/file> | grep CTGTAGG | wc –l
 grep -c "CTGTAGG" <path/file>
lbcd-05:GKG13demo deepseq$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
lbcd-05:GKG13demo deepseq$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of
6 425 957 sequences
… not bad (98.8%)
A contrario
lbcd-05:GKG13demo deepseq$ cat
308
GKG-13.fastq | grep ATCTCGT| wc -l
Take home message n°1
Unix Operating Systems already contain powerful
native tools for text analysis
Regular expression
$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
wc with –l option
perl interpreter is
counts the lines
called with –ne
options (loop &
The output is
execute)
In line perl code The output is
passed to the input
passed to the input
of the next
of the next
command
command
Outputs the content
of a file, line by line
Quality Control. Can I trust my sequences ?
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Demo with the GUI version
Quality Control. Can I trust my sequences ?
fastQC in GALAXY
How to clip the adapter ?
http://hannonlab.cshl.edu/fastx_toolkit/index.html
3’ adapter: CTGTAGGCACCATCAAT
fastq_to_fasta -r -n -i GKG-13.fastq | fastx_clipper -a CTGTAGGCACCATCAAT -l 18 -o GKG-13_clip-pipe.fasta
Clipping in GALAXY
http://bowtie-bio.sourceforge.net/
Bowtie aligns reads on indexed
genomes
• Download, install Bowtie and rtfm.
• Download your genome (format FASTA)
• Build the Bowtie index using bowtie-build
deepseq$ bowtie-build fasta_libraries/dmel-all-chromosome-r5.37.fasta dmel-r5.37
~nn min (but indexed references available)
deepseq$ ls –laht
-rw-r--r-- 1 deepseq
-rw-r--r-- 1 deepseq
-rw-r--r-- 1 deepseq
-rw-r--r-- 1 deepseq
-rw-r--r-- 1 deepseq
-rw-r--r-- 1 deepseq
staff
staff
staff
staff
staff
staff
49M Mar 24 17:24 dmel-r5.37.rev.1.ebwt
19M Mar 24 17:24 dmel-r5.37.rev.2.ebwt
49M Mar 24 17:20 dmel-r5.37.1.ebwt
19M Mar 24 17:20 dmel-r5.37.2.ebwt
331K Mar 24 17:16 dmel-r5.37.3.ebwt
39M Mar 24 17:16 dmel-r5.37.4.ebwt
A bowtie alignment (Demo on Mac)
Deepseq$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -k 1 -p 2 --al
droso_matched_GKG-13.fa --un unmatched_GKG13.fa > GKG13_bowtie_output.tabulated
~/bin/bowtie-0.12.7/indexes/dmel
-f GKG-13_clip-pipe.fasta
-v 1
-k 1
-p 2
--al droso_matched_GKG-13.fa
--un unmatched_GKG13.fa
> GKG13_bowtie_output.tabulated
# reads processed: 5997502
# reads with at least one reported alignment: 5045151 (84.12%)
# reads that failed to align: 952351 (15.88%)
Reported 5045151 alignments to 1 output stream(s)
Bowtie outputs
deepseq$ ls -laht
-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated
-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa
-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
Tabular alignment report
deepseq$ more GKG13_bowtie_output.tabulated
21
+
2L
20487495
TGGAATGTAAAGAAGTATGGAG
30
3L
15836559
GTGAATTCTCCCAGTGCCAAG
25
+
3R
5916902 TGAACACAGCTGGTGGTATCC
23
2L
11953462
CCCGTGAATTCTTCCAGTGCCATT
27
+
3R
5916902 TGAACACAGCTGGTGGTATC
26
3R
9289997 TCCTGCGGCACTAGTACTTA
18
2L
11953465
GTGAATTCTTCCAGTGCCATT
22
3R
8377246 ATTGCTGGAATCAAGTTGCTGAC
20
+
3L
11650036
TTTGTGACCGACACTAACGGGTA
24
+
2R
16493585
TGGAAGACTAGTGATTTTGTT
28
+
3L
10358380
TAGGAACTTCATACCGTGCTCT
35
+
X
18022302
CTTGTGCGTGTGACAGCGGCT
41
3RHet 138608 TGGCGACCGTGACAGGACCCG
42
+
3R
5916902 TGAACACAGCTGGTGGTATCC
Aligned sequences
Unaligned sequences
deepseq$ more droso_matched_GKG-13.fa
>21
TGGAATGTAAAGAAGTATGGAG
>26
TAAGTACTAGTGCCGCAGGA
>24
TGGAAGACTAGTGATTTTGTT
>23
AATGGCACTGGAAGAATTCACGGG
>27
TGAACACAGCTGGTGGTATC
deepseq$ more unmatched_GKG13.fa
>29
AGGGGGCTATTTCACTACTGGA
>33
CGATGATGACGGTACCCGTAGA
>37
GCTAGTCGGTACTTGAAAC
>59
TGGTTGCAATAGCTTCTGGCGGA
>61
GATGAGTGCTAGATGTAGGGA
A pipeline for small RNA annotation (see in GED Galaxy)
Sequence reads (fasta format)
Bowtie
Pre-miRNAs (miRBase)
Unmatched reads
Bowtie
Non coding RNAs
Unmatched reads
Bowtie
Transposons
Matched reads
(fasta)
Read Count
Matched reads
(fasta)
Read Count
hierarchical
Matched reads
(fasta)
Read Count
Matched reads
(fasta)
Read Count
annotation
Unmatched reads
Bowtie
Genes
of
sequence
Unmatched reads
Bowtie
Matched reads
(fasta)
Read Count
Viruses, transgenes, etc…Matched reads
(fasta)
Read Count
Intergenic regions
Unmatched reads
Bowtie
Remaining unmatched sequences
datasets
samtools
http://samtools.sourceforge.net/
Sam format
Bam format (for Genome Browsers)
• Sorted
• Indexed
• Compressed
Preparation of a BAM file and its associated index
$ bowtie ~/bin/bowtie-0.12.7/indexes/dmel -f GKG-13_clip-pipe.fasta -v 1 -M 1 --best -p 2 -S | samtools view -bS -o GKG-13_clippipe.fasta.bam - ; samtools sort GKG-13_clip-pipe.fasta.bam GKG-13_clip-pipe.fasta.bam.sorted ; samtools index GKG-13_clippipe.fasta.bam.sorted.bam

306K GKG-13_clip-pipe.fasta.bam.sorted.bam.bai
42M GKG-13_clip-pipe.fasta.bam.sorted.bam
80M GKG-13_clip-pipe.fasta.bam
~3 min
Read visualization in a Genome Browser
Upload of BAM file to a remote server (amazon cloud)
Passing the URL to Ensembl (Gbrowse, Modencode, etc..)
Naive and primed murine pluripotent stem cells have distinct miRNA signatures
A. Jouneau (INRA Jouy en Josas)
E. Heard (Institut Curie)
C. Antoniewski (Institut Pasteur)
M. Cohen-Tannoudji (Institut Pasteur)
ESC1 ESC2 EpiSC3EpiSC1EpiSC2
miRNA profiling
deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt
Sequence reads (fasta format)
Bowtie
Pre-miRNAs (miRBase)
Bowtie Output
Parsing


Read maps for all miRNAs
Hit list for miR_5p and miR_3p
miR profiling / hit list agregation
Differential Calling
deepseq$ miRNA_bowtie_profiler.py GKG-13_clip-pipe.fasta ~/bin/bowtie/indexes/dme_miR_r17.1.ebwt
Sequence reads (fasta format)
Bowtie
Pre-miRNAs (miRBase)
Bowtie Output
Parsing


Read maps for all miRNAs
Hit list for miR_5p et miR_3p
http://www.r-project.org/
DESeq
Heatplus
(Bioconductor)
touRism
Load DESeq, gplots and RcolorBrewer
countsTable <- read.delim( "~/Documents/Pasteur_DEMO/mouse_hits.txt", header=TRUE, stringsAsFactors=TRUE )
head(countsTable)
rownames(countsTable)<- countsTable$gene
countsTable <- countsTable[ , -1 ]
head(countsTable)
summary(countsTable)
plot(countsTable)
plot(log(countsTable,10))
conds <- c( "EPI", "EPI", "EPI", "ES", "ES" )
cds <- newCountDataSet( countsTable, conds )
cds <- estimateSizeFactors( cds )
sizeFactors( cds )
cds = estimateDispersions( cds, method="pooled")
vsd <- getVarianceStabilizedData( cds )
dists <- dist( t( vsd ) )
heatmap( as.matrix( dists ), symm=TRUE, margins=c(12,12),cexRow=1, cexCol=1)
SampleVar<-apply(vsd,1,var)
vsd2<-cbind(vsd,SampleVar)
vsd3<-vsd2[order(vsd2[,6], decreasing=TRUE),]
head(vsd3)
vsd3<-head(vsd3,100)
vsd3<-vsd3[,-6]
head(vsd3)
heatmap.2(vsd3, col=brewer.pal(11, "RdBu"), scale="none", trace="none", margins=c(3,45), ,cexRow=0.7, cexCol=1, dendrogram="column", density.info="none",
keysize=0.7)
cds = estimateDispersions( cds, method="per-condition", sharingMode="fit-only")
res = nbinomTest( cds, "EPI", "ES" )
resNA = res[-which(is.na(res[,8])),]
resNA[order(resNA[,8]), ]
Deep Seq Data Analysis, Final Take Home Messages

Think to your deep seq replicates at starting

Keep a hand on your data, from « fastq stage »

Keep a hand on the analysis because this is your project

Always keep an eye on « Normalization » and « Differential »

Don’t be afraid by bioinformatics, but don’t reinvent the wheel

It’s open source, open manual

It’s not magic, yes you can

It’s fun

You cannot escape, so take it easy.
Download