ChIPSeq Bioinformatic Pipeline

advertisement
November 16th, 2015
Di Rienzo lab meeting
Introduction of the ChIP-seq pipeline
Shigeki Nakagome
ChIP-seq
Step 1: Chromatin immunoprecipitation (IP)
▸ A target protein (e.g. VDR) binds to
DNA in an open chromatin region
▸ Sonicate open chromatin regions
▸ Capture VDR binding to DNA
fragments by VDR antibody
▸ IPed DNA fragments are enriched
with genomic regions bound by VDR
Step 2: Next generation sequencing
▸ Making a library with IPed DNA
and sequence it by illumina
(Single-end read; 50bp)
Park (2009) Nat Rev Genet
A workflow of processing sequenced data
1. Mapping sequence reads
2. Checking the quality of IP
Leung et al. (2015) Nature
▸ Identifying TF binding sites
▸ Identifying SNPs associated with
allelic imbalance
3. Calling peaks
(using MACS2)
3. Correct mapping bias
(using WASP)
4. Annotating peaks
(e.g. genomic context or the
closest gene; using HOMER)
4. Calling genotypes and
testing allelic imbalance
(using QuASAR)
A workflow of processing sequenced data
1. Mapping sequence reads
2. Checking the quality of IP
Leung et al. (2015) Nature
▸ Identifying TF binding sites
▸ Identifying SNPs associated with
allelic imbalance
3. Calling peaks
(using MACS2)
3. Correct mapping bias
(using WASP)
4. Annotating peaks
(e.g. genomic context or the
closest gene; using HOMER)
4. Calling genotypes and
testing allelic imbalance
(using QuASAR)
1. Mapping sequence reads
i) Map sequence reads to the human reference genome using BWA
▸ Aligning sequence reads:
bwa aln -n 2 -o 0 REFERENCE.fa SEQUENCED.fastq > SEQUENCED.fastq.sai
▸ Generating a SAM format file:
bwa samse REFERENCE.fa SEQUENCED.fastq.sai SEQUENCED.fastq >
SEQUENCED.fastq.sam
ii) Choose uniquely mapped reads based on
▸ Extracting uniquely mapped sequence reads based on the flag, “XT:A:U”:
grep "XT:A:U" SEQUENCED.fastq.sam > SEQUENCED.fastq.sam.tmp
▸ Filtering sequence reads with a mapping quality > 30:
samtools view -bhS -q 30 -F 4 -o SEQUENCED.fastq.bam SEQUENCED.fastq.sam.tmp
iii) Remove PCR duplicates if sequence reads have identical coordinates
▸ Running a program, Picard:
/group/../java -jar /group/../picard.jar MarkDuplicates
INPUT=SEQUENCED.fastq.bam OUTPUT=SEQUENCED.fastq.bam.picard
METRICS_FILE=rmdup.out REMOVE_DUPLICATES=true …
iv) Use SEQUENCED.fastq.bam.picard (“uniquely mapped” + “non-PCR duplicates”)
in downstream analyses
2. Checking the quality of IP
▸ Only 50 bp of the IPed DNA fragments is sequenced from the 5’ end, so the alignment
results in two peaks from positive and negative strands
Park (2009) Nat Rev Genet
▸ If IP works, the densities (i.e. the numbers of sequence reads) from two peaks are
correlated, keeping a certain distance (i.e. a length of each fragment)
2. Checking the quality of IP
▸ Measure a Strand Cross-Correlation (SCC) plot using a R program
▸ Y-axis: cross-correlation(CC) between
the densities of two peaks
Rscript /group/../run_spp_nodups.R -c=SEQUENCED.fastq.bam.picard –savp
-out=SEQUENCED.fastq.bam.picard.spp.out
ChIP-seq peak
(ChIPcc)
Phantom peak (corresponding
to the read length: 50bp)
(Pcc)
▸ X-axis: strand shift (i.e. distance between the peaks of positive and negative strands)
▸ There are two peaks: one is a noise (phantom peak) and the other is IPed peak.
▸ Two statistics are defined:
Normalized strand coefficient (NAC): ChIPcc /mincc
Relative strand correlation (RSC): (ChIPcc-mincc)/(Pcc -mincc)
▸ According to ENCODE project, “NAC > 1.05” and “RSC > 0.8” are thresholds for good IPed data
3. Calling peaks (using MACS2)
▸ MACS2 detects peaks with a significant enrichment of sequence tags by assuming Poisson
distribution, 𝜆𝑘 𝑒 −𝜆 𝑘!, where 𝜆 is a mean and calculated from the number of sequence tags
within a given window in an input and 𝑘 is the number in an IPed sample.
macs2 callpeak -t /group/../SEQUENCED.fastq.bam.picard -c
/group/../SEQUENCED_input.fastq.bam.picard -f BAM -g hs --bw XXX
--qvalue=0.05 -n out_file_macs2
▸ The output files include the information on locations of peaks:
Chr
Start
End
Length
-log10p-value
Fold enrichment
-log10q-value
chr1
13917095
13917284
190
25.95897
12.36291
20.51727
chr3
48264238
48264477
240
98.21003
31.35001
90.15012
4. Annotating peaks
(e.g. genomic context or the closest gene; using HOMER)
▸ Convert .bed file into .peak file using bed2pos.pl packaged in HOMER:
bed2pos.pl out_file_macs2_summits.bed > out_file_macs2_summits.peak
▸ Run findMotifsGenome.pl to find motifs in the peaks called by MACS2:
findMotifsGenome.pl out_file_macs2_summits.peak hg19 out_file_homer
-size 100 -len 8,10,12,14,16
Rank
P-value
Log(P-value)
% of
Targets
Best Match/Details
1
1e-246
-5.687e+02
52.47%
MA0074.1_RXRA::VDR/Jaspar
2
1e-33
-7.605e+01
4.93%
VDR(NR),DR3/GM10855-VDR+vitDChIP-Seq(GSE22484)/Homer
3
1e-30
-6.924e+01
28.25%
MF0004.1_Nuclear_Receptor_class
/Jaspar
▸ Run annotatePeaks.pl to annotate the peaks:
annotatePeaks.pl out_file_macs2_summits.peak hg19 -size -100,100
-m homer_top10.motif > out_file_macs2_summits
▸ The output file includes the information on:
Peak ID
Chr
Start
End
Strand
Peak
score
…
Detailed
Annotation
Distance
to TSS
…
Gene
Name
…
XXX
chr3
48264247
48264447
+
90.15012
…
promoter-TSS
(NM_004345)
-490
…
CAMP
…
YYY
chr5
139986717
139986917
+
49.46125
…
L1MB4|LINE|L1
26218
…
CD14
…
A workflow of processing sequenced data
1. Mapping sequence reads
2. Checking the quality of IP
Leung et al. (2015) Nature
▸ Identifying TF binding sites
▸ Identifying SNPs associated with
allelic imbalance
3. Calling peaks
(using MACS2)
3. Correct mapping bias
(using WASP)
4. Annotating peaks
(e.g. genomic context or the
closest gene; using HOMER)
4. Calling genotypes and
testing allelic imbalance
(using QuASAR)
3. Correct mapping bias
(using WASP)
▸ WASP is a program to carefully map allele-specific reads, correct for incorrect
heterozygous genotype calls, and model overdispersion of sequencing reads
van de Geijn et al. (2015) Nature Methods
▸ This is an algorithm implemented in WASP to overcome mapping bias from
reads with a reference allele
4. Calling genotypes and testing allelic imbalance
(using QuASAR)
▸ Using the samtools mpileup command, create a pileup file from aligned reads:
samtools mpileup -f /group/../hg19_all_contigs.fa -l
/group/../1KG_SNPs_filt.bed /group/../input.bam | gzip >
input.pileup.gz
▸ Convert the pileup file into bed format and use intersectBed to include the allele
frequencies from a bed file:
less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ &&
$5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ &&
$5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin |
intersectBed -a stdin -b /group/../1KG_SNPs_filt.bed -wo | cut -f 17,11-14 | gzip > input.pileup.bed.gz
▸ Generate an input file for QuASAR:
R --vanilla --args input.pileup.bed.gz < /group/../convertPileupToQuasar.R
Chr
Start
End
Ref
Alt
SNP ID
chr1
1498376
1498377
C
T
rs11260611
0.12
chr1
5348913
5348914
C
T
rs12124941
0.43
#alt
#not mapped to
either allele
4
0
0
2
1
0
Freq #ref
4. Calling genotypes and testing allelic imbalance
(using QuASAR)
▸ THP1 treated by VD; FAIREseq data
P-value = 0.0062; rs3738668
(A:21/C:2)
▸ Monocytes treated by VD; VDR ChIPseq data
P-value = 0.2492109; rs11784276
(T:5/C:2)
Download