November 16th, 2015 Di Rienzo lab meeting Introduction of the ChIP-seq pipeline Shigeki Nakagome ChIP-seq Step 1: Chromatin immunoprecipitation (IP) ▸ A target protein (e.g. VDR) binds to DNA in an open chromatin region ▸ Sonicate open chromatin regions ▸ Capture VDR binding to DNA fragments by VDR antibody ▸ IPed DNA fragments are enriched with genomic regions bound by VDR Step 2: Next generation sequencing ▸ Making a library with IPed DNA and sequence it by illumina (Single-end read; 50bp) Park (2009) Nat Rev Genet A workflow of processing sequenced data 1. Mapping sequence reads 2. Checking the quality of IP Leung et al. (2015) Nature ▸ Identifying TF binding sites ▸ Identifying SNPs associated with allelic imbalance 3. Calling peaks (using MACS2) 3. Correct mapping bias (using WASP) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) 4. Calling genotypes and testing allelic imbalance (using QuASAR) A workflow of processing sequenced data 1. Mapping sequence reads 2. Checking the quality of IP Leung et al. (2015) Nature ▸ Identifying TF binding sites ▸ Identifying SNPs associated with allelic imbalance 3. Calling peaks (using MACS2) 3. Correct mapping bias (using WASP) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 1. Mapping sequence reads i) Map sequence reads to the human reference genome using BWA ▸ Aligning sequence reads: bwa aln -n 2 -o 0 REFERENCE.fa SEQUENCED.fastq > SEQUENCED.fastq.sai ▸ Generating a SAM format file: bwa samse REFERENCE.fa SEQUENCED.fastq.sai SEQUENCED.fastq > SEQUENCED.fastq.sam ii) Choose uniquely mapped reads based on ▸ Extracting uniquely mapped sequence reads based on the flag, “XT:A:U”: grep "XT:A:U" SEQUENCED.fastq.sam > SEQUENCED.fastq.sam.tmp ▸ Filtering sequence reads with a mapping quality > 30: samtools view -bhS -q 30 -F 4 -o SEQUENCED.fastq.bam SEQUENCED.fastq.sam.tmp iii) Remove PCR duplicates if sequence reads have identical coordinates ▸ Running a program, Picard: /group/../java -jar /group/../picard.jar MarkDuplicates INPUT=SEQUENCED.fastq.bam OUTPUT=SEQUENCED.fastq.bam.picard METRICS_FILE=rmdup.out REMOVE_DUPLICATES=true … iv) Use SEQUENCED.fastq.bam.picard (“uniquely mapped” + “non-PCR duplicates”) in downstream analyses 2. Checking the quality of IP ▸ Only 50 bp of the IPed DNA fragments is sequenced from the 5’ end, so the alignment results in two peaks from positive and negative strands Park (2009) Nat Rev Genet ▸ If IP works, the densities (i.e. the numbers of sequence reads) from two peaks are correlated, keeping a certain distance (i.e. a length of each fragment) 2. Checking the quality of IP ▸ Measure a Strand Cross-Correlation (SCC) plot using a R program ▸ Y-axis: cross-correlation(CC) between the densities of two peaks Rscript /group/../run_spp_nodups.R -c=SEQUENCED.fastq.bam.picard –savp -out=SEQUENCED.fastq.bam.picard.spp.out ChIP-seq peak (ChIPcc) Phantom peak (corresponding to the read length: 50bp) (Pcc) ▸ X-axis: strand shift (i.e. distance between the peaks of positive and negative strands) ▸ There are two peaks: one is a noise (phantom peak) and the other is IPed peak. ▸ Two statistics are defined: Normalized strand coefficient (NAC): ChIPcc /mincc Relative strand correlation (RSC): (ChIPcc-mincc)/(Pcc -mincc) ▸ According to ENCODE project, “NAC > 1.05” and “RSC > 0.8” are thresholds for good IPed data 3. Calling peaks (using MACS2) ▸ MACS2 detects peaks with a significant enrichment of sequence tags by assuming Poisson distribution, 𝜆𝑘 𝑒 −𝜆 𝑘!, where 𝜆 is a mean and calculated from the number of sequence tags within a given window in an input and 𝑘 is the number in an IPed sample. macs2 callpeak -t /group/../SEQUENCED.fastq.bam.picard -c /group/../SEQUENCED_input.fastq.bam.picard -f BAM -g hs --bw XXX --qvalue=0.05 -n out_file_macs2 ▸ The output files include the information on locations of peaks: Chr Start End Length -log10p-value Fold enrichment -log10q-value chr1 13917095 13917284 190 25.95897 12.36291 20.51727 chr3 48264238 48264477 240 98.21003 31.35001 90.15012 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) ▸ Convert .bed file into .peak file using bed2pos.pl packaged in HOMER: bed2pos.pl out_file_macs2_summits.bed > out_file_macs2_summits.peak ▸ Run findMotifsGenome.pl to find motifs in the peaks called by MACS2: findMotifsGenome.pl out_file_macs2_summits.peak hg19 out_file_homer -size 100 -len 8,10,12,14,16 Rank P-value Log(P-value) % of Targets Best Match/Details 1 1e-246 -5.687e+02 52.47% MA0074.1_RXRA::VDR/Jaspar 2 1e-33 -7.605e+01 4.93% VDR(NR),DR3/GM10855-VDR+vitDChIP-Seq(GSE22484)/Homer 3 1e-30 -6.924e+01 28.25% MF0004.1_Nuclear_Receptor_class /Jaspar ▸ Run annotatePeaks.pl to annotate the peaks: annotatePeaks.pl out_file_macs2_summits.peak hg19 -size -100,100 -m homer_top10.motif > out_file_macs2_summits ▸ The output file includes the information on: Peak ID Chr Start End Strand Peak score … Detailed Annotation Distance to TSS … Gene Name … XXX chr3 48264247 48264447 + 90.15012 … promoter-TSS (NM_004345) -490 … CAMP … YYY chr5 139986717 139986917 + 49.46125 … L1MB4|LINE|L1 26218 … CD14 … A workflow of processing sequenced data 1. Mapping sequence reads 2. Checking the quality of IP Leung et al. (2015) Nature ▸ Identifying TF binding sites ▸ Identifying SNPs associated with allelic imbalance 3. Calling peaks (using MACS2) 3. Correct mapping bias (using WASP) 4. Annotating peaks (e.g. genomic context or the closest gene; using HOMER) 4. Calling genotypes and testing allelic imbalance (using QuASAR) 3. Correct mapping bias (using WASP) ▸ WASP is a program to carefully map allele-specific reads, correct for incorrect heterozygous genotype calls, and model overdispersion of sequencing reads van de Geijn et al. (2015) Nature Methods ▸ This is an algorithm implemented in WASP to overcome mapping bias from reads with a reference allele 4. Calling genotypes and testing allelic imbalance (using QuASAR) ▸ Using the samtools mpileup command, create a pileup file from aligned reads: samtools mpileup -f /group/../hg19_all_contigs.fa -l /group/../1KG_SNPs_filt.bed /group/../input.bam | gzip > input.pileup.gz ▸ Convert the pileup file into bed format and use intersectBed to include the allele frequencies from a bed file: less input.pileup.gz | awk -v OFS='\t' '{ if ($4>0 && $5 !~ /[^\^][<>]/ && $5 !~ /\+[0-9]+[ACGTNacgtn]+/ && $5 !~ /-[0-9]+[ACGTNacgtn]+/ && $5 !~ /[^\^]\*/) print $1,$2-1,$2,$3,$4,$5,$6}' | sortBed -i stdin | intersectBed -a stdin -b /group/../1KG_SNPs_filt.bed -wo | cut -f 17,11-14 | gzip > input.pileup.bed.gz ▸ Generate an input file for QuASAR: R --vanilla --args input.pileup.bed.gz < /group/../convertPileupToQuasar.R Chr Start End Ref Alt SNP ID chr1 1498376 1498377 C T rs11260611 0.12 chr1 5348913 5348914 C T rs12124941 0.43 #alt #not mapped to either allele 4 0 0 2 1 0 Freq #ref 4. Calling genotypes and testing allelic imbalance (using QuASAR) ▸ THP1 treated by VD; FAIREseq data P-value = 0.0062; rs3738668 (A:21/C:2) ▸ Monocytes treated by VD; VDR ChIPseq data P-value = 0.2492109; rs11784276 (T:5/C:2)