Microsoft Word Document - Computational Cancer Genomics - Vital-IT

PRINCIPLES OF CHIP-SEQ DATA ANALYSIS ILLUSTRATED WITH EXAMPLES

In this tutorial we are going to illustrate the capabilities of the ChIP-Seq server using data from an early landmark paper on STAT1 binding sites in γ-interferon stimulated HeLa cells

( Robertson, et al., 2007 ). This data set, which comprises about 15 million mapped sequence tags, and is available from the ChIP-Seq server menu.

What follows is a step-by-step description of how the results have been produced.

The list of data files that have been used for the analysis can be found at: http://ccg.vital-it.ch/chipseq/doc/resources

Note that some of the analysis steps described in this tutorial rely on programs from the

Signal Search Analysis (SSA) server at: http://ccg.vital-it.ch/ssa/

1. 5’-3’ end correlation (ChIP-Cor)

We start by generating a 5’-3’ correlation plot using

ChIP-Cor

. We use the 5’ (+ strand) tags as reference feature and compute the frequencies of 3’ tags as a function of the distance from the reference feature.

To this end, open the ChIP-Cor server home page at: http://ccg.vital-it.ch/chipseq/chip_cor.php

Fill out the form as follows:

Input Data Reference Feature

 Select available Data Sets

Genome : H.sapiens (March 2006/hg18)

Data type : ChIP-seq

Series : Robertson 2007, HeLa S3 cells, Genome-wide STAT1 profiles

Sample : Hela S3 STAT1 INFg

Additional Input Data Options

Strand : +

Analysis Parameters

Range : -1000 to 1000

Histogram Parameters

Window width : 10

Count Cut-off : 1

Normalization : count density

Input Data Target Feature







Strand : -

On the output page you will see the following picture:

ChIP-Cor offers several options for scaling the abundance of the target feature. Here, we have chosen “count density”, which is defined as the number of target feature tags per base pair. We note a Gaussian peak with a maximum at about position +150, suggesting that the average length of an immunoprecipitated fragment is about 150 bp. In all subsequent analyses, we will therefore use half of this value (75 bp) as centering distance for jointly analyzing 5’ and 3’ tags. Centering means shifting the positions of tags mapping to the + or − strand of the chromosome by a fixed distance downstream and or upstream, respectively.

Centering increases the resolution of the ChIP-Seq data.

On the left side, there is a weak shoulder at about position +35 which results from a common artifact seen in almost all ChIP-

Seq experiments. We can generate the same plot for the control data set (‘Hela S3 STAT1 unstim’ sample) and compare the two distributions. The ChIP-Seq tag distribution for the control data set is similar to the background distribution.

Next, we generate a so-called autocorrelation plot for centered STAT1 tags against themselves (same reference and target feature).

The step-by-step procedure is the following:







Strand : any

Centering : 75

Repeat Masker : unchecked


Range : -1000 to 1000


Window width : 10

Count Cut-off : 1

Normalization : count density







Strand : any

Centering : 75


The results are shown below:

We see again a Gaussian peak this time with a maximum at 0. The ChIP-Cor server automatically attempts to fit the correlation histogram to a Gaussian curve. If successful, the results of the fit can be accessed via hyperlinks on the output page. Results are provided in graphical and textual form. The link ‘Single Gaussian Fit’ takes you to a figure showing an optimal Gaussian fit and the corresponding parameters (the parameter μ is the peak center position, here μ = 0). The text output file under the ‘Parameters’ link contains recommended parameters for the subsequent peak finding step.

2. Peak detection

The Gaussian fit to the auto-correlation plot suggests to use a window of 286 bp and a peak threshold value of 12 tags. We round the window to 300.

Below are the results obtained from the “Single Gaussian Fit” in the previous analysis:

Single Gaussian Fit Parameters

Formula: y ~ a + (b/(sig*sqrt(2*pi))) * exp(-(x - mean)^2/(2 * sig^2))

Estim. SE t Pr(>|t|) a 0.00934 7.36e-05 127 2.43e-189 b 7.86 0.0906 86.7 1.07e-157 sig 143 1.65 87 5.63e-158 mean 0.507 1.5 0.338 0.736

Residual = 0.01487

Peak Finding Parameters Estimate

Av. BG density (cnts)

Peak window width (bp)

Peak threshold (cnts)

Peak Threshold

0.00934

286

12

Expected Random Peaks

8

9

10

11

12

19437

5045

1196

261

53

We will use these parameters with ChIP-Peak . Go to the input form at:

• http://ccg.vital-it.ch/chipseq/chip_peak.php

and fill it out as follows:

ChIP-Seq Input Data







Strand : any

Centering : 75


Peak Detection Parameters

Window Width (bp) : 300

Vicinity Range (bp) : 300

Peak Threshold : 12

Count Cut-off : 1

Refine Peak Position : checked

Genome Viewing Parameters :

Wig Track name : unchecked (blank)

Chromosome Region : unchecked

Running then ChIP-Peak with the recommended tag threshold returns

55’937

peaks. We have to be aware that 12 is a minimal threshold maximizing sensitivity. For many types of downstream analysis more stringently selected peak lists are preferable. We therefore repeat

ChIP-Peak with higher thresholds of 25, 50 and 100 tags and obtain 16’337, 4’445 and 1’522 peaks, respectively.

ChIP-Peak returns peak lists in three formats, SGA, FPS, and BED. It is recommended to save them in all three formats for further analysis. Moreover, the output page contains an action button allowing for remapping of the chromosomal coordinates to other genome assemblies. We use this button to remap all peaks from hg18 to hg19, in order to be able to jointly analyze our peaks with server-resident data provided only for hg19.

Some of the identified STAT1 peaks fall into repetitive elements of the human genome. These peaks may cause problems for certain types of downstream analysis, in particular DNA motif discovery. Since all ChIP-Seq server input forms allow users to filter out tags falling into repeat regions, we rerun ChIP-Peak one more time with the RepeatMasker checkbox activated. The number of peaks found with different parameter combinations, are given in the

Table below:

STAT1 sites list

Peaks, threshold 12

Peaks, threshold 25

Peaks, threshold 30

Peaks, threshold 50

Peaks, threshold 100 whole genome hg18 hg19

55937 55922

16337 16332

11478

4445

1522

11474

4442

1521 repeat masked hg18 hg19

43255 48926

12354 14155

8583

3250

1076

9902

3778

1276

3. Peak analysis with external tools

The output page of ChIP-Peak contains links to other web resources. The link to the UCSC browser enables the user to view individual STAT1 peaks in the context of other genomic features. The output page also contains a hyperlink to the GREAT server (McLean, et al.,

2010), which performs GO enrichment term analysis of the genes in the neighborhood of the peaks. A mouse-click on this link returns the GOterm list shown here below:

We note that the majority terms relate to in cytokine mediated signaling consistent with the reported biological function of STAT1.

Another topic of interest is the location of TF binding peaks relative to protein coding genes.

One web-based resource performing such an analysis is Nebula (Boeva, et al., 2012). It returns graphics showing the abundance of peaks within promoter regions, gene bodies, intergenic regions, and components of genes (intron, exons, etc.). .

Go to Nebula .

• Import the BED-formatted peak file generated with ChIP-Peak by selecting

Upload File under Get Data (on the left side of the window) and select bed as file format and Human Mar. 2006 (NCBI36/hg18) as genome version.

The uploaded file will appear on the right side of the page.

• On the left side, select NGS: Peak Annotation under NGS TOOLBOX . A number of tools will appear. Choose Genomic annotation of Chip-Seq peaks. A new menu will appear. Use default parameters and click on Execute . The results files (text and image) will appear on the right side.

The Figure below shows the results for the STAT1 peaks list obtained running ChIP-Peak with threshold 50 and Repeat Masker on:

Nebula also returns a peak annotation table, indicating for each peak the nearest gene and its relative location to that gene.

4. Motif studies in peak regions

STAT1 is known to bind to a DNA motif approximately described by the consensus sequence

TTCNNNGAA. If the peaks found by ChIP-Peak were real binding sites, one would expect this motif to be over-represented near the peak center positions. In fact, motif enrichment analysis is commonly used for benchmarking the performance ChIP-Seq peak finders

( Wilbanks and Facciotti, 2010 ). The OProf program of the SSA server can be used for this purpose. It returns a graph showing the percentage of sequences containing a motif in a sliding window around a reference position.

Step-by-step procedure

Go to:

• http://ccg.vital-it.ch/ssa/oprof.php

Then fill out the input form as follows:

SSA Input Data Signal Description

 Upload FPS file

stat1_t12_hg19.fps

FPS name : ChIP-Peak

FPS type: STAT1 peaks

Sequence Range

Entire sequence range : unchecked

5'border : -499 3'border : 500

Sliding window parameters

Window size : 100 Window shift: 5

Search mode : bidirectional

 Consensus seq

TTCNNNGAA

Mismatches : 0

Name : TTCNNNGAA

Reference Position : 5

The fields "FS name", "FPS type" and "Name" (on the right side) only define text elements displayed in the output figure. They do not influence the analysis in any other ways. We use the search mode "bidirectional" because ChIP-Seq peaks are un-oriented. In the output, you can see a clear peak centered at position 0. You can repeat the same type of analysis for different peak thresholds.

The Figure below shows the motif occurrence profiles for TTCNNNGAA for the four different STAT1 peak lists obtained with different tag thresholds (12, 25, 50, 100). With all peak lists, we see a clear enrichment of STAT1 motifs near position zero (the reported peak center). As expected, the peak height is inversely correlated to the number of the peaks.

The OProf server provides access to a large number of functional sites collections, including two STAT1 peak lists by the ENCODE consortium.


Go to:


Then fill out the input form as follows:



Genome : H.sapiens (Feb2009/hg19)

Data type : ENCODE ChIP-seq-peak

Series : Wang et al. 2012,

Transcription Factor Binding Sites from ENCODE Stanford/Yale/USC/Harvard

Sample : Hela-S3 STAT1 std - IFNg30

- peaks

Sequence Range






 Consensus seq

TTCNNNGAA

Mismatches : 0

Name : TTCNNNGAA

Reference Position : 5

Carry out the same analysis on sample ‘K562 - STAT1 std - IFNg30 – peaks’.

The Figure shows consensus sequence enrichment profiles for these peaks lists and the ones generated by ChIP-Peak:

We note that our lists generated from earlier data compares favorably to the ENCODE peak lists, both in terms of enrichment (peak volume) and positional resolution (peak width).

For most TFs, consensus sequences can only provide approximations of the true binding motifs. Position weight matrices (PWMs) are widely used to describe the binding specificity of TFs. The OProf server provides menu-driven access to PWMs from several public resources, including a STAT1 matrix from the JASPAR database.

To search the JASPAR MA0137.2 STAT1 motif, fill out the OProf input form as follows:



stat1_t50_hg19.fps



Sequence Range






 PWMs from Library

Motif Library : JASPAR CORE 2009 vertebrates

Motif : MA0137.2 STAT1(length=15)

Cut-off : p-value

Value: 0.0001

Name : MA0137.2 STAT1

Reference position : 8

Save the results in TEXT format for later use (making composite figures).

5. De novo motif discovery with MEME-ChIP

An even better matrix may be obtained by applying a de novo motif discovery program to peak regions. We used the popular MEME-ChIP server ( Machanick and Bailey, 2011 ) for this purpose.


The sequence file provided as input to the MEME-ChIP server was downloaded with the sequence extraction tool on ChIP-Peak output page.

Go to the Chip-Peak page at:

• http://ccg.vital-it.ch/chipseq/chip_peak.php

and fill it out as follows:

ChIP-Seq Input Data







Strand : any

Centering : 75

Repeat Masker : checked

Peak Detection Parameters

Window Width (bp) : 300

Vicinity Range (bp) : 300

Peak Threshold : 100

Count Cut-off : 1

Refine Peak Position : checked

Genome Viewing Parameters :

Wig Track name : unchecked (blank)

Chromosome Region : unchecked

We used the repeat-masked peak list obtained with tag threshold 100. Extract sequences from position −60 to +60 relative to the peaks center, using the menu that appears at the bottom of the ChIP-Peak result page:

Sequence Extraction Option

From : -60 To: + 60

This returns a list of sequences around the peaks in FASTA format. A new link ‘Sequence

File’ will appear. Open this link.

Now, open the home page of the MEME Suite in another browser window:

• http://meme.sdsc.edu and click on the MEME-Chip link near the lower right corner of the page. Enter your email address, then copy&paste the sequences from the open ChIP-Peak window into the text area of the MEME-ChIP page that serves for sequence input.

Since we expect the STAT1 binding motif to be palindromic, we restrict the MEME-ChIP search to palindromic motifs.

Keep all the default parameters except the maximum motif width (=15), and check the box near ‘look for palindromes only’.

On the MEME-ChIP result page, click on the link ‘MEME-ChIP html output’ to get the result page. Here is the main motif found by MEME:

E-value 1.2e-803

Width 11

Sites 553

You can expand all clusters to show all motifs. Besides the graphical display of motif 1, you can click on the link “MEME” to view the MEME output. At the “Discovered Motifs” section, beside the logo of motif 1, you click on the arrow “Submit/Download” and a pop-up window will appear allowing you to download the motif. There, click on "Download Motif" and select the “Minimal MEME” format. Both probability and log-odds matrices will be displayed. The ‘log-odds’ matrix scoring is the following:

-89 5 -96 92

-656 -438 -462 193

-468 -607 -303 190

-100 180 -549 -414

-369 163 -508 -22

10 -11 -11 10

-22 -508 163 -369

-414 -549 180 -100

190 -303 -608 -468

193 -462 -438 -656

92 -96 5 -89

Keep this window open and access the OProf input form in another window:


Fill out the form as follows :



stat1_t50_hg19.fps



Sequence Range




Window size : 100

Window shift : 5


 Custom Weight Matrix

Format: Integer PWM

PWM text : cut&paste matrix from

MEME output into text area

Cut-off : p-value

Value: 0.0001

Name : MEME STAT1 motif 1


The p-value thresholds for all motifs have been chosen so as to get an equal background distribution for all profiles.

The Figure below shows motif enrichment profiles for the STAT1 consensus sequence, the

JASPAR matrix, and the de novo generated matrix. We note that PWMs show higher enrichment than the consensus sequence. However, there isn’t an obvious difference between the matrix generated by MEME and the JASPAR matrix.

Sequence logos for the two PWMs are shown here below:

6. Exploring the genomic context of STAT1 peaks

ChIP-Cor enables the user to generate aggregation plots for a great variety of target features from peak lists. We first investigate whether the STAT1 binding sites are associated with active or repressive histone marks. Since the STAT1 binding experiment was carried out in

HeLa cells, we choose histone modification data from the same cell type generated by the

ENCODE consortium. Specifically, we would like to test an active promoter mark

(H3K4me3), an active enhancer mark (H3K27ac) and a repressive chromatin mark

(H3K27me3). Remember in this context, that STAT1 peaks were discovered in HeLa cells that were stimulated with gamma-interferon. On the other hand, the histone modification maps from ENCODE were obtained from non-stimulated cells, in which STAT1 is not supposed to be bound to genomic target sites.


Go to: http://ccg.vital-it.ch/chipseq/chip_cor.php



 Upload Custom Data

Format : SGA

File : stat1_t25_hg19.sga

Genomes : H. sapiens (Feb 2009/hg19)


Strand : any



Range : -5000 to 5000


Window width : 10

Count Cut-off : 1

Normalization : global



Genome : H.sapiens (Feb 2009/hg19)

Data type : ENCODE ChIP-seq

Series : GSE29611, Histone

Modifications by ChIP-seq

Sample : Hela-S3 H3K4me3

Strand : any

Centering : 70


The results are shown below:

Save the results in TEXT format.

If we repeat the same procedure for samples ‘HeLa-S3 H3K17ac’ and ‘HeLa-S3 H3K27me3’ we get the following results:

We see that STAT1 peaks fall into regions of about 500 base-pairs which are 15-fold enriched in H3K27ac compared to the background. A 7-fold enrichment is observed with the promoter mark H3K4me3 and no enrichment is seen for the H3K27me3 on STAT1. This result suggests that STAT1 binds primarily to regions that are already in an active chromatin state before interferon induction. The valley at position zero probably reflects nucleosome-free regions.

We may wonder whether these STAT1 bound enhancers are also active in other cell types.

We decide to compare H3K27ac marks in HeLa cells along with three other cell types: embryonic stem cells, a lymphoblastoid cell line, and the cancer-derived K562 cell line.

Repeat the step-by-step procedure as described before for samples: ‘Hela-S3 H3K27me3’,

‘H1-hESC H3K27ac’, ‘GM12878 H3K27me3’, and ‘K562 H3K27me3’.

Here is the aggregation plot:

We see an approximately two-fold higher enrichment in HeLa compared to the other cell types, suggesting a substantial degree of tissue-specificity of STAT1-bound regulatory regions.

Next, we can also explore DNase I hypersensitivity, sequence conservation, and population variation data near STAT1 sites.

To this end, we proceed as follows. step-by-step procedure





Format : SGA




Strand : any



Range : -1000 to 1000


Window width : 10

Count Cut-off : 1





Data type : ENCODE DNase FAIRE etc.

Series : Thurman 2012, DNaseI

Hypersensitivity by Digital DNaseI from ENCODE/OpenChrom

Sample : DNaseI HS – H1-hESC – None

– Rep1

Strand : any


Save the results in TEXT format ad do the same for samples: ‘DNaseI HS – GM12878 –

None- Rep1’, ‘DNaseI HS – HeLa-S3 – None- Rep1’, and ‘DNaseI HS – K562 – None-

Rep1’. Note that DNase data do not require centering.

The aggregation plot for DNaseI hypersensitivity is the following:

In summary, STAT1 peaks occur preferentially within DNase hypersensitivity regions of about 200 bp.

For sequence conservation and population variation studies, you can proceed as follows.






Format : SGA




Strand : any



Range : -1000 to 1000


Window width : 10

Count Cut-off : 10


Save the results in TEXT format and proceed as follows:




Data type : Sequence-derived

Series : Vertebrate Conservation

(phastCons46way)

Sample : PHASTCONS VERT46

Strand : any




Format : SGA




Strand : any



Range : -1000 to 1000


Window width : 50





Series : SNP collection from 1000 genome project

Sample : Common SNPs

Strand : any


Count Cut-off : 10


Save the results in TEXT format and do the same for sample ‘All Indels’.

The results are shown here below:

Increased cross-species conservation is observed in a larger region of about 300 bp.

Consistently with this observation, we see proportional depletion of indel variations, and to a lesser extent of common SNPs.

7. High resolution aggregation plots for bound PWM matches.

According to the motif occurrence analysis (first Figure of paragraph 4), our peak list has a positional precision of +/- 50 bp as indicated by the average peak width. Aggregation plots of potentially higher resolutions could be obtained by using the in vivo occupied binding motif locations as anchor point rather than the peak center positions inferred from ChIP-Seq data.

The SSA program FindM allows for generating such a list of in vivo occupied motifs. We upload the peak list obtained with threshold 25 (stat1_t25_rmsk_hg19.fps) to the FindM web form. Then we paste the MEME-generated PWM in the text area reserved for this purpose and search for motifs matches within peak regions from -60 bp to +60 bp relative to the peak center. To generate a random control set, we also collect an approximately equal number of

PWM matches from far outside the peak region (+10000 to +12000).


Go to:

• http://ccg.vital-it.ch/ssa/findm.php

Fill out the form as follows :



stat1_t25_rmsk_hg19.fps

FPS name : Custom FPS

FPS type: Unknown

Sequence Range



Sequence Selection and Search Criteria

Search mode : forward

Sequence Selection mode :

best matches

or

multiple non-overlapping matches


Format: Integer PWM



Cut-off : p-value

Value: 0.0001

Name : custom PWM


Sequence Output Format

FPS: checked

After submission, download and save the list of STAT1 occupied binding motifs

(stat1_t25_rmsk_hg19_sites.fps).

For obtaining the control list (stat1_t25_rmsk_hg19_control.fps), repeat the above procedure changing the Sequence Range to 5’border: +10000, 3’border: +12000.

To study sequence conservation using PhyloP base-wise conservation scores across STAT1 in-vivo occupied binding sites, proceed as follows:





Format : FPS

File : stat1_t25_rmsk_hg19_sites.fps



Strand : any







Series : phyloP base-wise conservation

Sample : PhyloP vertebrate 46way

(score>=2)

Strand : any


Range : -1000 to 1000


Window width : 10

Count Cut-off : 10


Save the results in TEXT format and repeat the analysis for the previous peak list

(stat1_t25_rmsk_hg19.fps), and for the control list (stat1_t25_rmsk_hg19_control.fps).

Results are shown here below:

The Figure shows single base resolution plots for sequence conservation (PhyloP scores). We can zoom in on the binding motif region to better characterize single base conservation within motif regions.

To this end, go to: http://ccg.vital-it.ch/chipseq/chip_cor.php

and fill out the form as follows:



Format : FPS

File : stat1_t25_rmsk_hg19_sites.fps







Series : phyloP base-wise conservation

Sample : PhyloP vertebrate 46way

Strand : any



Range : -12 to 12


Window width : 1

Count Cut-off : 10


(score>=2)

Strand : any


Save the results in TEXT format and to the same for the control list.

Results are shown below:

We see increased sequence conservation within the 9 bp regions that make up the STAT1 binding motif. As expected, the center position, which is essentially unconstrained, is not more conserved than the flanking regions. The degree of sequence conservation is generally much lower for the non-occupied sites.

Lists of occupied motifs rather than peak centers are also useful to investigate interactions between sequence motifs. This can be done using OProf to generate a motif occurrence profiles of STAT1 PWM matches downstream of in vivo occupied STAT1 sites.

Here is the step-by-step procedure .

Go to OProf via the link below and fill out the form as shown below:




stat1_t25_rmsk_hg19_sites.fps

FPS name : STAT1 ini-vivo occupied motifs

FPS type: BS

Sequence Range


5'border : 0 3'border : 100


Window size : 13

Window shift : 1



Format: Integer PWM



Cut-off : p-value

Value: 0.001

Name : custom PWM


We save the results in Text format and repeat the same analysis for the control set.

Here are the results:

We see a narrow peak (±2bp) centered 21 bp downstream of the reference site which is absent in plot generated with the control set. This previously observed preferential occurrence of two in vivo occupied STAT1 sites at a center-to-center distance of two helical turns may be related to a tetrameric binding mode documented for some members of the STAT family (Schmid and Bucher, 2010 ).

Microsoft Word Document - Computational Cancer Genomics - Vital-IT

Related documents

Products

Support

Microsoft Word Document - Computational Cancer Genomics - Vital-IT

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib