Max-Planck-Institut für molekulare Genetik Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Software Praktikum, 1.2.2013 Folie 1 Max-Planck-Institut für molekulare Genetik Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression Problem: There are many genes and many TF's, how do we identify the targets of a TF? Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik PWM Genome Scan •Purely computational method •Input: o o position weight matrix for your TF genomic region(s) of interest Score threshold •Pros: o No need to do wet lab experiments •Cons: o Many false positives, not able to take biological conditions into account Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik PWM genome scan 1) Download the PWMs of your TF of interest from the database (they might include >1 motif) 1) Define the sequences to analyze (promoter sequences) 1) Run the PWM genome scan (hitbased method or affinity prediction method) 1) Rank the genomic sequences by the affinity signal Software Praktikum, 1.2.2013 Suggested Reading: • Roider et al.: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics (2007). • Thomas-Chollier et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature Protocols (2011). Folie 6 Max-Planck-Institut für molekulare Genetik PWM-PSCM Stat3 pscm Binding motif for the transcription factor: Stat3 from ChIP-seq experiment in mouse (Jaspar ID: MA0144.1) Software Praktikum, 1.2.2013 Folie 7 Max-Planck-Institut für molekulare Genetik TRAP 1) Convert the PSSM(position 2) 3) 4) 5) specific scoring matrix) to PSEM (position specific energy matrix) Scan the sequences of interest with TRAP Results in 1 score per sequence=binding affinity Doesn’t separate the exact TF binding sites (easier for ranking) Sequences must have the same length! ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/Release TRAP trap.molgen.mpg.de/cgi-bin/home.cgi Software Praktikum, 1.2.2013 Folie 8 Max-Planck-Institut für molekulare Genetik Matrix-scan 1) Use directly the PSSM 2) Finds all TFBS which exceed a predefined threshold (e.g. p-value) 3) More complicated to create ranked lists of genomic sequences (more hits in the sequence) 4) Exact location of the binding site reported matrix-scan http://rsat.ulb.ac.be/ Software Praktikum, 1.2.2013 Folie 9 Max-Planck-Institut für molekulare Genetik Finding the target genes • target genes will be the top-ranked genes (promoters) • which are the top-ranked genes? (top-100,500,1000...?) • There’s no exact definition of promoters, usually 2000bp upstream, 500bp downstream of the TSS Software Praktikum, 1.2.2013 Folie 10 Max-Planck-Institut für molekulare Genetik Microarrays → R/Bioconductor (details later) Software Praktikum, 1.2.2013 Max-Planck-Institut Folie 12 Genetik für molekulare Microarrays (2) •Pros: o o o There is a lot of microarray data already available (might not have to generate the data yourself) Inexpensive and not very difficult to perform Computational workflow is well established •Cons: o Can not distinguish between indirect regulation and direct regulation Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik ChIP-seq Map reads to the genome Call peaks to determine most likely TF binding locations Software Praktikum, 1.2.2013 Max-Planck-Institut Folie 14 Genetik für molekulare ChIP-seq (2) •Pros: o Direct measure of genome-wide protein-DNA interaction(*) •Cons: o o o o o Don't know whether binding causes changes in gene expression More complicated experimentally and in terms of computational analysis Most expensive Need an antibody against your protein of interest Biases are not as well understood as with microarrays Software Praktikum, 1.2.2013 Max-Planck-Institut für molekulare Genetik ChIP-seq analysis 1) Download the reads from given source (experiments and controls) 2) Quality control of the reads and statistics (fastqc) 3) Mapping the reads to the reference genome (bwa/Bowtie) 4) Peak calling (MACS) 5) Visualization of the peaks in a genome browser (genome browser, IGV) 6) Finding the closest genes to the peaks(Bioconductor/ChIPp eakAnno) Software Praktikum, 1.2.2013 Visualised peaks in a genome browser Suggested Reading: • Bailey et alPractical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol (2013). • Thomas-Chollier et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols (2012). Folie 15 Max-Planck-Institut für molekulare Genetik Sequencing data • • raw data=reads usually very large file (few GB) format fastq (ENCODE) or SRA (Sequence Read Archive of NCBI) Analysis 1) Quality control with fastqc 2) Filtering of reads with adapter sequences 3) Mapping of the reads to the reference genome (bwa or Bowtie) Example of fastq data file Software Praktikum, 1.2.2013 Folie 16 Max-Planck-Institut für molekulare Genetik Quality control with fastqc • • • • • • • • per base quality sequence quality (avg. > 20) sequence length sequence duplication level (duplication by PCR) overrepresented sequences/kmers (adapter sequences) produces a html report manual (read it!) Example of per base seq quality scores software at the MPI FASTQC=/scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc Software Praktikum, 1.2.2013 Folie 17 Max-Planck-Institut für molekulare Genetik Mapping with bwa • • • 1) 2) mapping the sequencing reads to a reference genome manual (read it!) map the experiments and the controls reference genome in fasta format (hg19) create an index of the reference file for faster mapping (only if not available) 3) align the reads (specify parameters e.g. for # of mismatches, read trimming, threads used...) 4) generate alignments in the SAM format (different commands for single-end and pair-end reads!) software and data at the MPI: BWA = /scratch/ngsvin/bin/executables/bwa hg19: /scratch/ngsvin/MappingIndices/hg19.fa bwa index: /scratch/ngsvin/MappingIndices/BWA/hg19 Software Praktikum, 1.2.2013 Folie 18 Max-Planck-Institut für molekulare Genetik File manipulation with samtools • • 1) 2) 3) utilities that manipulate SAM/BAM files manual (read it!) merge the replicates in one file (still separate experiment and control) convert the SAM file into BAM file (binary version of SAM, smaller) sort and index the BAM file now the sequencing files are ready for further analysis software at the MPI: SAMTOOLS = /scratch/ngsvin/bin/executables/samtools Software Praktikum, 1.2.2013 Folie 19 Max-Planck-Institut für molekulare Genetik Peak finding with MACS • find the peaks, i.e. the regions with a high density of reads, where the studied TF was bound • manual (read it!) 1) call the peaks using the experiment (treatment) data vs. control 2) set the parameters e.g. fragment length, treatment of duplication reads 3) analyse the MACS results (BED file with peaks/summits) software at the MPI: MACS = /scratch/ngsvin/bin/executables/macs Software Praktikum, 1.2.2013 Folie 20 Max-Planck-Institut für molekulare Genetik Finding the target genes • • • find the genes which are in the closest distance to the (significant) peaks how to define the closest distance? (+- X kb) use ChIPpeakAnno in Bioconductor or bedtools Scale chr10: 69,200,000 78 _ GM12878 c-Myc Sg 100 kb hg18 69,250,000 69,300,000 69,350,000 UCSC Genes (RefSeq, GenBank, tRNAs & Comparative Genomics) DNAJC12 SIRT1 DNAJC12 SIRT1 SIRT1 HERC4 HERC4 HERC4 HERC4 HERC4 KIAA1593 ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in GM12878 cells) HERC4 ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in GM12878 cells) 0_ ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-Myc in K562 cells) 78 _ ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-Myc in K562 cells) K562 c-Myc Sig 0_ RepeatMasker Software Praktikum, 1.2.2013 Repeating Elements by RepeatMasker Folie 21 Max-Planck-Institut für molekulare Genetik Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq Software Praktikum, 1.2.2013 Threshold s Max-Planck-Institut für molekulare Genetik Bioinformatics • • • • Read mapping (Bowtie/bwa) Peak Calling (MACS/Bioconduct • or) • Peak-Target Analysis (Bioconductor) Microarray data analysis (Bioconductor) Differential Genes (R) GSEA • • Software Praktikum, 1.2.2013 • • PWM Genome Scan (TRAP/MatScan) Statistics (R) Data Integration (R/Python/Perl) Statistical Analysis (R) Folie 23 Max-Planck-Institut für molekulare Genetik Bioinformatics tools READ THE MANUALS! • • • • • • Bowtie bowtie-bio.sourceforge.net/manual.shtml bwa bio-bwa.sourceforge.net/bwa.shtml MACS github.com/taoliu/MACS/blob/macs_v1/README.rst TRAP trap.molgen.mpg.de/cgi-bin/home.cgi matrix-scan http://rsat.ulb.ac.be/ Bioconductor www.bioconductor.org/ (more info in R course) Databases • • • • GEO www.ncbi.nlm.nih.gov/geo/ ENCODE genome.ucsc.edu/ENCODE/ SRA www.ncbi.nlm.nih.gov/sra JASPAR http://jaspar.genereg.net/ Software Praktikum, 1.2.2013 Folie 24 Max-Planck-Institut für molekulare Genetik Schedule • • • • • • • 03.03. Introduction lecture, R course 04.03. R & Bioconductor homework submission 11.03. Presentation of the detailed plan of each group (which TF, cell line, tools, data, data integration, team work ) 10:30am, 11:30am every Tuesday 10:30am, 11:30am progress meetings 17.04. Final report deadline 24.04. (tentative) Presentations 28.04. Final meeting, discussion of final reports Software Praktikum, 1.2.2013 Folie 25 Max-Planck-Institut für molekulare Genetik GR Group • Expression and ChIP-seq data: Luca F, Maranville JC, et al., PLoS ONE, 2013 • PWM database: jaspar.genereg.net Software Praktikum, 1.2.2013 Folie 26 Max-Planck-Institut für molekulare Genetik c-Myc Group • Expression data: Cappellen, Schlange, Bauer et al., EMBO reports, 2007 • Musgrove et al., PLoS One, 2008 • ChIP-seq data: ENCODE Project • PWM database: jaspar.genereg.net Software Praktikum, 1.2.2013 Folie 27 Max-Planck-Institut für molekulare Genetik Additional analysis Binding motifs binding motifs • are the overrepresented motifs in the ChIP-peak regions different? • do we find any co-factors? Recommended tool: RSAT rsat.ulb.ac.be binding motifs Software Praktikum, 1.2.2013 binding motifs Folie 28