Chapter 1 Analysis of ChIP-Seq Data with Partek

Genomics Suite™ 6.6

Overview

ChIP-Sequencing technology (ChIP-Seq) uses high-throughput DNA sequencing to map protein-DNA interactions across the entire genome. Partek ® Genomics Suite™ (PGS) offers convenient visualization and analysis of the high volumes of data generated by ChIP-

Seq.

In this tutorial, you will go through the PGS ChIP-Seq workflow and will analyze aligned data from a ChIP sample versus a control sample in .bam format.

This tutorial will illustrate how to

 Import ChIP-Seq data

 Perform QA/QC of the samples

 Detect and visualize peaks and enriched regions in the genome

 Discover binding site motifs

 Annotate enriched regions with overlapping genes

 Visualize mapped sequence reads on the genome

Note: the workflow described is specific for PGS version 6.6. To upgrade to this version, go to the Main menu and access Help > Check for Updates . The screenshots shown below may vary slightly across hardware platforms and across different versions of PGS.

Description of the Data Set

The data for this tutorial is from Johnson et al. (2007) that maps the genomic binding sites of the NRSF (neuron-restrictive silencer factor) transcription factor across the entire genome. It includes two samples: an NRSF-enriched ChIP sample (chip.bam) and a control sample without immuno-enrichment (mock.bam). The chip.bam file contains almost 1.7 million mapped reads, and the mock.bam file contains approximately 2.3 million mapped reads. These bam files contain the aligned genomic locations and sequences of the mappable reads. This dataset contains reads from a single-end (SE) library; the differences in processing paired-end (PE) reads will also be discussed when applicable.

Data and associated files for this tutorial can be downloaded from the Next Generation

Sequencing tab on Help > On-line Tutorials from the PGS main menu.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 1

Import Instructions

The steps below will briefly describe how to import the mapped reads of ChIP-Seq data into PGS.

Step 1 – Download the Data

 Download, unzip the tutorial data, and save the bam files on your computer. Due to the large file sizes associated with NGS data, it is recommended that bam files be accessed locally (not across the network). The first time a bam file is read in by

PGS, the file will be sorted to allow faster access; therefore, you must have write permission on the bam files and in the bam file folder.

Step 2 - Import Mapped Reads into PGS

 Open the ChIP-Seq workflow within PGS by selecting it from the Workflows drop-down in the upper right corner of the menu

 Under Import from the ChIP-Seq workflow, select Import and manage samples to invoke the Sequence Import wizard

 Using the file browser on the left, navigate to the ChIP-Seq_Data folder containing the bam files. For this tutorial, select chip.bam

and mock.bam

(Figure

1). Select

OK

Figure 1: Selecting ChIP-Seq files. Date modified may be different than what is shown

 In the Sequence Import dialog, specify the Output file , Species , and Genome build . For this tutorial, set Species to Homo sapiens and Genome build to hg18 .

The Output file will be the name of the parent spreadsheet. Select OK

The Bam Sample Manager

(Figure 2) can be used to add new samples or files to the project

( Add samples ), to remove samples ( Remove selected samples ), to associate (multiple) files with particular samples ( Manage samples ), and to map the chromosome names from the input files to the annotation files ( Manage sequence names) . Since none of these operations are needed, select Close . If the bam file has not been sorted previously by PGS, you may see the Sort bam files dialog; select OK to sort the files if this dialog box appears. While the files are being sorted, you will see a message in the status bar at the bottom of the window:

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 2

Figure 2: Bam Sample Manager used to add or remove additional bam files to the experiment

The resulting spreadsheet is shown in Figure 3. Each sample will be on one row.

The number of aligned reads per sample is shown in column 2. The import process is now finished

Figure 3: Viewing the spreadsheet after import. Each row contains a sample

Quality Control of Samples

In addition to any quality control that may have been performed when the data was sequenced, it is a good idea to check the quality of the samples using PGS before analyzing the data.

Examining the Distribution of Reads

BAM files contain both aligned and unaligned reads. The top-level spreadsheet in Figure 3

shows the number of reads that were aligned to the reference genome. A large number of unaligned reads may be the result of poor quality sequence data or alignment problems

(wrong genome, alignment settings, etc.). You might also be interested in knowing how many reads map to more than one location in the genome (if the aligner options supported multiple-mapped reads).

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 3

 In the QA/QC section of the ChIP-Seq workflow, select Alignments per read

 A new spreadsheet called Alignment_Counts

is generated (Figure 4). The titles of

columns 2 and 3 and indicate that this is single-ended data. Column 2 shows the number of unaligned reads (0 alignments per read), and column 3 shows the number of reads that align exactly once to the genome (1 alignment per read). If the BAM files had contained reads that mapped to more than one location in the genome, these would be shown after column 3

Figure 4: Alignment_Counts spreadsheet. The unaligned reads had been removed from these BAM files and the alignment options did not permit more than one mapping location per read

Strand Cross-Correlation

In short-read ChIP-Seq data, peaks are found upstream of the actual DNA-binding site

(upstream on both strands). In a good quality ChIP-Seq sample, the peaks on the forward strand and the reverse strand are offset (phase-shifted) by the size of the “effective fragment length.” The effective fragment length tends to be shorter than the length of the fragmented DNA, the length of the size selection, and the pull-down length. Strand Cross-

Correlation calculates the correlation of the strand-specific read densities; the maximum correlation should occur at the average size of the peak shift across all chromosomes.

For single-end reads, PGS will calculate the phase shift between the reads on the forward strand and reads on the reverse strand using the method (Pearson cross-correlation) described by Kharchenko et al. (2008). Note: the estimation of effective fragment length for single-end reads can only be done on IP samples and not on mock controls since nonenriched samples do not contain a phase shift. For paired-end reads, Strand Cross-

Correlation is calculated from the distribution of fragment lengths between the paired-ends of the two reads.

 Under QA/QC from the ChIP-Seq Workflow , select Strand Cross-Correlation . If you have not run this step previously, you will be asked if you would like to create a new QA/QC child spreadsheet. If prompted, select Yes

 After running Strand Cross-Correlation from the QA/QC workflow, the Strand

Separation of Samples

viewer will appear (Figure 5)

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 4

Figure 5: Viewing the Strand-Cross Correlation plot to estimate effective fragment length

In Figure 5, the x-axis represents the phase-shift, and the y-axis represents the Pearson

correlation of the strand densities of the forward and reverse strands. Notice in the IP sample, the peak occurs at 111 bp, corresponding to an average effective fragment length of 111 base pairs. The peak location can be determined by examining the values in the strand_correlation spreadsheet, by mousing over the peak in the graph, or by sorting the data in the spreadsheet.

The control sample (blue) does not have a similar peak because it does not have the phaseshift property of IP samples. The control sample does have a small peak at 26 bp which corresponds to the sequencing read length. This is probably due to the fact that some regions in the genome of the control sample contain many reads stacked up on each other which will create a correlation peak when the forward and reverse strands are shifted by the length of the reads. At the sequencing read length, the IP-sample will show a strand crosscorrelation near 0.

The location and magnitude of the peaks in the cross-correlation plot can be used as a

measure of the quality of the enriched sample. Figure 5 shows a highly enriched sample

because the peak at 111 bp dominates the peak at the read length. If the dominant peak in the IP-enriched sample occurred at the read length, the sample was poorly enriched or

contained very few binding sites. The plot in Figure 6 shows two IP-samples with medium-

level enrichment. Multiple dominant peaks in the IP sample may indicate there are several populations of DNA fragment lengths which will complicate peak calling (Kundaje 2010).

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 5

Figure 6: Example of medium-level enriched samples

Detecting Peaks and Enriched Regions

Regions that contain a binding site for the DNA-binding protein of interest will have many sequence reads mapped to it. Since single-end reads only cover one end of a sequence fragment, enriched regions will generally show two adjacent peaks. PGS will directionally extend each SE read in the 3’ direction by the fragment length (extended reads) to facilitate merging adjacent peaks into a single peak. For PE reads, the fragment length is defined from the start of the 5’ end of the first read through the 3’ end of its paired read. For peak detection, PGS divides the genome into windows (bins) of a user-defined size and counts the number of (midpoints of) the reads that fall within each bin. PGS fits a zero-truncated negative binomial to the bin counts and finds all regions that are above a user-defined false discovery rate (FDR). See the ChIP-Seq white paper for more information on the peakfinding algorithm and tips for setting the Fragment extension and window sizes.

 Under Peak Analysis of the ChIP-Seq workflow, select Detect peaks .

The Detect peaks

dialog will appear (Figure 7)

 Specify the Fragment Extensions by setting the Maximum average fragment size to 110. Maximum average fragment size is based on your experimental design: the size of the fragment pulled-down in the immunoprecipitation step, the size used during DNA fragmentation, the fragment length used for size selection, or the effective fragment length. If you have used an antibody that binds the DNA as the control antibody (rather than no-enrichment as the control), you could use

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 6

different fragment lengths for both samples with the Individual maximum fragment sizes radio button. For experiments a using mock control (no enrichment), use Maximum average fragment size

 As this example uses the mock sample as the reference, select mock in the dropdown list under Reference sample

 The peak detection algorithm will divide the genome into windows and find windows that are enriched with reads based on the FDR value. Set the Window

Size to (base pairs) to 110

 Peak Cut-off FDR determines the cut-off for the significance peaks in the chip sample. Lower cut-off values imply greater differences between the chip and mock peaks; higher cut-offs lessen the difference in peak heights between the chip and mock samples. Set the Peak Cut-off FDR to 1 false positives in 1000 ( 0.001

)

 Leave the remaining parameters with the default values and select OK

Note: As transcription factor binding sites tend to have localized and sharp clusters of reads, the window size used during the analysis of a transcription factor study can be left relatively small (approximately the same as the average fragment length), and the option to allow for gaps between enriched windows need not to be used. Subsequently, in the Results reporting section, the Region in the window with most reads could also be selected. Histone modification peaks, on the other hand, tend to be subtle, diffuse, and spread-out. For that type of analysis, larger windows might be more suitable, and neighboring windows may be combined ( Within a gap distance of option) into larger windows (under Window size and Results reporting , respectively). The exact settings depend on the data and the experiment design, so fine tuning is recommended.

The More info link at the top of the dialog box displays a figure which demonstrates the relationship between window and gap size. Try changing the How should windows be merged or the Which regions should be reported? options; the blue bar underneath each figure will reflect how regions are detected and reported with these settings.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 7

Figure 7: Configuring the Peak Detection dialog

Figure 8: Viewing the detected peaks in the samples

The resulting spreadsheet (Figure 8) will appear. The spreadsheet is sorted by chromosome

number and genomic location. Each row represents one genomic region of peak enrichment whereas the columns are:

1. Chromosome : Chromosome of region

2. Start : Start of region (inclusive)

3. Stop : End of region (exclusive)

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 8

4. Sample ID : The sample containing the enriched region

5. Interval Length : length of region, i.e., Stop – Start, in base pairs

6. Maximum Extended Reads in Window : The greatest number of (extended) reads in one of the windows of a ChIP-Seq region

7. Reads per Million (RPM) : Column 6 divided by the total number of aligned reads in the sample (in millions). This column will help you compare peaks across samples, especially when there is a large difference in the number of aligned reads between samples

8 . Mann-Whitney p-value : Identifies separation between forward and reverse peaks for single-end reads using the Mann-Whitney U-test. Lower p-values indicate better separation. This p-value can be used when there was no control sample or to eliminate reads due to PCR bias

9.-10 . Total reads in region : Total number of (non-extended) reads for each sample (chip and mock, respectively) in the given genomic region

11. p-value(Sample ID vs. mock) : Compares each sample to the reference (mock in this example) using a one-tailed binomial test. A low p-value means there are significantly more reads in the sample specified in column 4 (that is, for each region) than in the mock sample. This column is only included if a reference sample is specified in the Peak

Detection

dialog (Figure 7)

12. scaled fold change (Sample ID vs. mock) : Compares intensity of signal between each sample (specified in column 4) to the reference sample (mock in this example). The foldchange is scaled by a ratio of the number of reads for each sample (IP vs. control) on a perchromosome basis. Scaled fold changes > 1 indicate more enrichment in the IP-sample than in the control sample. This column is only included if a reference sample is specified in the

Peak Detection dialog

13.-14. <Sample> overlap percent : Fraction of called region that overlaps a region from the given sample where <Sample> is the name of the Sample ID in column 4. For example, the values of 100% in column 13 and 0% in column 14 point to regions detected in the chip sample, but not in the mock sample. Similarly, regions with the value of 100% in column

14 were detected in the mock sample (and thus might be excluded from downstream analyses)

Create a list of enriched regions

You have just created a list of peaks found in both samples. In this section, you will create a list that filters out peaks detected in the chip sample that also occur in the control (mock) sample. This list will be used to search for motif binding sites.

 Under Peak Analysis of the ChIP-Seq workflow, select Create a list of enriched regions . The regions found in the IP sample that do not have many reads in the control sample are of most interest. Use the List Creator functions to filter out regions that have a high number of reads in the control (mock) sample by using the p-value against the control

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 9

 Select Specify New Criteria . Give the new criteria a name such as p-value filtered , select the 1/regions (peaks) Spreadsheet , and choose Column 11. pvalue(Sample ID vs. mock) . Include p-values so that comparison of the number of reads in the sample compared to the control has a p-value less than 0.05 by including significant with FDR of 0.05

. The dialog should look like Figure 9.

Select OK

Figure 9: Configure criteria dialog to filter out peaks that occur in the control sample

 Before closing the List Creator dialog, Save the list you just created. The spreadsheet should have 2473 rows. The resulting regions are those that have significantly more reads in the chip sample than in the mock sample. Select Close to exit the dialog

Other List Creator

operations (Figure 10) like the

Venn Diagram and Union (Or) or

Intersection (And) of the lists could also be performed to create a list of “true” enriched peaks. For instance, you could filter on the intersection between FDR and Peaks not in mock or you may choose to filter by scaled fold change or apply a minimum number of reads per million ( RPM ). The choice of how to create a list of “true” peaks is up to you and may be different for different kinds of experimental designs.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 10

Figure 10: List Creator commands

de novo Motif Discovery and Motif Search

Now that you have a list of enriched regions, you will learn how to find recurring patterns or motifs in these regions. A transcription factor can bind to many sites throughout the genome. These sites usually share a certain pattern in their sequences (consensus sequence). By searching for these binding site motifs, you can determine the binding site pattern and the locations of binding in the genome. PGS detects de novo motifs using the

Gibbs motif sampler (Neuwald et al., 1995).

A known database of transcription factors such as JASPAR ( http://jaspar.cgb.ki.se/ ) can be searched or de novo motifs may be identified using only the sequences from the identified regions to find motifs.

Step 1 – de novo Motif Discovery

 Under Peak Analysis , select Motif discovery . The two options for motif discovery, Discover de novo motifs and Search for known motifs , will be discussed separately

 Select Discover de novo motifs and OK

 Choose 1/p-value_filtered as the Spreadsheet with genomic regions . Use the default settings: Number of Motifs 1 , Discover motifs of length between 6 and 16 base pairs, and Result file : Motifs . Select OK . If the reference genome has not been previously downloaded onto this computer, you may be asked if you would like to download the .2bit reference genome. If prompted, select Automatically download a .2bit

file and OK if PGS is able to connect to the Internet properly. If you do not have an Internet connection, choose one of the other two options:

( Manually specify a .2bit file or Create a .2bit file from reference fasta files ). The

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 11

reference genome is required for determining which genes overlap the enriched peak regions and for displaying the aligned sequences

A motif visualization plot (Figure 11) and two spreadsheets will be generated.

One spreadsheet, motifs (Motifs) , contains information about the motif, and the other, instances (Motifs_instances.txt) , lists the genomic locations of the motif. If

your motif does not look exactly like Figure 11, select the

Reverse button, which will give you the reverse complement of the motif

Figure 11: Viewing the binding site motif for NRSF. Use the yellow arrows in the upper right to cycle through views of all the motifs found (if more than one was found)

Description of Motif Output

Sequence Logo window

The Sequence Logo

window (Figure 11) graphically displays the best motif found in the

peak regions of the data. In this case, the motif finder discovered a motif in the NRSFenriched regions that is 15 base pairs in length. The height of each position is the relative entropy (in bits) and indicates the importance of a base at a particular location in the binding site. The title CAG.ACC..GGA.AG

is the consensus sequence for the sequence logo. Dots represent positions that contain more than one base across all reads in the motif.

The dots can be replaced with letters by checking the Show nucleotide codes checkbox; doing so will give characters representing the possible bases at that position. For a description of the IUPAC nucleotide codes, please visit: http://www.bioinformatics.org/sms/iupac.html

.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 12

Motifs spreadsheet

The motif information

spreadsheet (Figure 12), entitled

Motifs, lists the information about the motif that was visualized using the sequence logo. This includes the Counts of bases in each position of the pattern (column 1), the Consensus Sequence (column 2), the Motif ID

(column 3), the Log Likelihood Ratio of the motif (column 4), and the Background frequency of each of the bases in all of the sequences of that motif. The Log Likelihood

Ratio scores the relative likelihood that the found pattern did not occur by chance.

Figure 12: Viewing the motif spreadsheet

You can (re)display the Sequence Logo of the motif by right clicking on a row header and selecting Logo View . If more than one motif was found (in the de novo motif dialog, you

only requested one motif to be found), then the yellow arrows shown in Figure 11 may be

used to cycle through the motifs.

Motif_instances spreadsheet

The Motif_instances

spreadsheet (Figure 13), a child of the

Motifs spreadsheet and entitled instances, details all of the locations of the motif(s) in the enriched regions. Each row lists a putative binding site for a motif. The genomic location is given ( chromosome, start, end, and strand ), along with the Motif ID , the sequence found at that location, and a score of how likely that site is part of the motif. The list is sorted in order of descending score. The larger the score, the more likely the site is a true instance of the motif.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 13

Figure 13: Viewing the motif instances spreadsheet

Step 2 – Search JASPAR for Known Motifs

 Repeat the Motif discovery step; however, select the Search for known motifs radio button and OK . This will search the JASPAR database for motifs that are over-represented (more than by chance) in the list of sequences in the significant regions list. The JASPAR database will download automatically if needed during the Search for known motifs step. Downloading the JASPAR database will create a spreadsheet in your experiment named JASPAR.txt

that contains all of the species-specific motifs in the database. Visualization of the motifs is done by right-clicking on a row in the JASPAR.txt

spreadsheet and selecting Logo View .

The yellow arrows in the upper right corner ( ) may be used to cycle through visualization of the motifs in the JASPAR database

 The motif search should be performed on the p-value_filtered list. You may search for a particular element in the database or all of the elements in the database. For this tutorial, use the defaults and search for all of the motifs listed in

JASPAR database (Figure 14). Select

OK

Alternatively, you can also search the list of sequences for a single motif specified by a valid nucleotide sequence ( Search for motif ) or if you want look for several motifs, you can import them as a list (import the list as tab-delimited file) ( Import motifs from text file ).

This feature may also be used to import motifs from other databases to which you have access (TRANSFAC ® , custom database, etc.). Use the help button ( ) for specification of the format of the text file. Sequence Quality value is a number between 0 and 1 and

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 14

indicates how closely a sequence must match the pattern for it to be called an instance of the pattern. The higher the value, the closer it must match the pattern to be called.

Figure 14: Search for JASPAR Motifs in Sequences dialog

 Two resulting spreadsheets, similar to the spreadsheets in the de novo motif discovery step, will be generated, the motif_summary (MotifSearch) spreadsheet

(Figure 15) and the

motif_instances (MotifSearch.instance) spreadsheet

 Sort the motif_summary spreadsheet by p-value by right-clicking on the p-value column and selecting Sort Ascending

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 15

Figure 15: Motif_summary spreadsheet. Each motif from the JASPAR database (or other input database used) will be shown. Probability of Occurrence (column 2) is the probability of detecting a false positive for this motif in a random DNA sequence. Expected

Number of Occurrences (column 3) is the Probability of Occurrence times the total length of the reads. Actual Number of Occurrences (column 4) is the count of sequences that match the known motif in the reads. P-value (column 5) is the uncorrected p-value

(binomial test)

As you can see in Figure 15, REST (another name for NRSF) is at the top of the list. The

spreadsheet indicates that the expected number of by-chance occurrences of the

NRSF/REST motif is less than 1, but in fact, 1071 occurrences of the motif were observed, resulting in a very low p-value (0). This motif agrees with the motif found in the de novo motif detection step. Interestingly, other motifs appear a significant number of times in the

ChIP-Seq peaks and may represent possible co-factors.

The motif_instances spreadsheet contains all instances of the motifs (with actual counts >0) from the motif_summary spreadsheet.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 16

Generating a list of regions containing the REST motif (Optional)

Because the motif_instances spreadsheet contains every instance of every motif identified, you may wish to create a spreadsheet of just the REST instances that contains the locations of each of the 1071 instances of the REST motif.

 Select the motif_instances (MotifSearch.instance) spreadsheet in the spreadsheet navigator

 Select the Motif Name column header (column 5) in the spreadsheet

 Right-click and select Find / Replace / Select as shown

in Figure 16

Figure 16: Finding all REST peaks (step 1)

 In the next dialog, at Find What , type in REST and choose Select All at the bottom of the screen. This finds and selects the 1071 instances of the REST motif

as shown in Figure 17

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 17

Figure 17: Selecting all REST instances in motif_instances spreadsheet (step 2)

 Close the dialog. You will notice that in the original spreadsheet, the focus has shifted that so now row 12848 is highlighted and visible in the view.

 Right-click on row 12848 and select Filter Include

(Figure 18)

Figure 18: Including all REST instances that were identified by Find / Replace / Select

 Notice now that the motif_instances spreadsheet has 1071 rows and that a filter

has been applied (Figure 19)

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 18

Figure 19: Filtered motif_instances spreadsheet contains 1071 REST instances. The black and yellow bar at the far right shows that a filter has been applied to this spreadsheet

Filters are very powerful but will slow down spreadsheet operations on the original list.

Furthermore, the filter operation does not create a brand new spreadsheet. In order to create a spreadsheet that only contains the REST instances, it is necessary to clone the original spreadsheet with the filter applied, save the clone with a new name, and clear the filter from the original spreadsheet.

 Right click on motif_instances in the spreadsheet navigator and then select

Clone…

 In the Clone Spreadsheet dialog, type REST for Name of resulting copy and select 1/p-value_filtered/motif_summary (MotifSearch) from the pull-down menu of Create as a child of spreadsheet

 This creates a new spreadsheet in the spreadsheet navigator that has not been saved (there is an * after the spreadsheet name). Save the spreadsheet by rightclicking on the spreadsheet and selecting Save As… and type in REST as the File name

 To remove the filter from the original spreadsheet, right-click on motif_instances

(MotifSearch) in the spreadsheet navigator. Notice the yellow/black bar on the

right (also shown in Figure 19). Right-click anywhere in the yellow/black bar and

select Clear Filter . Now both the original spreadsheet and the REST spreadsheet exist without filters

Finding Nearest Genomic Features

In this section, you will learn how to find genomic features (genes) that are near the IPenriched regions of the data. You will also learn how to classify the peak locations by gene section (5’ UTR, 3’ UTR, Promoter, CDS).

Step 1 – Specify the Database

 Make sure the spreadsheet that you want to overlap with genes is active. In this case, you want to detect overlaps on the p-value_filtered spreadsheet, so select the p-value_filtered spreadsheet

 Under Peak Analysis , select Find nearest genomic feature . A dialog, similar to

Figure 20, will appear. Select

RefSeq Transcripts . A download of the database will be started if this information has not previously been downloaded onto your computer. Leave the promoter region boundaries as default and select OK

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 19

Figure 20: Configuring the dialog for finding genes that overlap enriched regions of the data

Step 2 – View the List of Nearest Genomic Features

The resulting spreadsheet ( gene-list

) (Figure 21) is a child of the

p-value_filtered spreadsheet. Each row represents a transcript with Transcript ID (column 5), Gene Symbol

(column 6), and genomic location of the transcript (columns 1-3). Distance to TSS (column

7) gives distance of each enriched region to the transcription start site (in base pairs; positive means downstream and negative means upstream). Overlap with gene and region are given in columns 8 and 9, respectively. Columns 10 and greater were already discussed

under the Detect Enriched Regions section.

Note : Percent overlap with gene is more likely to be high (close to 1) in cases where one region covers several genes (for example, histone studies). Percent overlap with region is likely to be high (close to 1) if a region is relatively small and is found completely within a gene (for example, transcription factor binding studies). If both columns are close to 1, then the gene and the region have nearly the same start and stop locations. If both columns are small (close to 0) then the region doesn't overlap with the gene directly but the region found likely covers only the promoter region.

Another way to interpret the percent overlap with region and percent overlap with gene is to use Peak Analysis > Classify regions by gene section.

This step is left for you to try on your own (the input should be a region list or filtered region list).

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 20

Figure 21: Identifying closest genomic features to regions spreadsheet

Visualize Reads and Enriched Regions

You have gone through the steps for importing data and detecting TF-enriched regions and have identified potential binding sites within these enriched regions. You might explore the functions of the genes these binding sites regulate by using other Biological Interpretation tools like GO Enrichment or Pathway Analysis which are discussed in other tutorials.

In this section, viewing the ChIP-Seq data using PGS’s Genome Viewer will be explained.

For more information about the viewer, see NGS Chromosome Viewer .

Step 1 – Load the Data into the Viewer

 Select the parent spreadsheet (WoldChipSeqBamFiles) containing the list of samples in two rows: one for the chip sample and one for the mock sample

 Under Visualization on the ChIP-Seq Workflow, select Plot chromosome view .

The left-hand side contains a list of tracks that can be visualized. The tracks that are shown by default are (from the top) the transcript tracks, the sequence read

visualization tracks, and the cytoband track (Figure 22)

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 21

Figure 22: Viewing the sample reads on chromosome 1

 To add additional tracks, select the New Track button on the left-hand side of the viewer. Choose Add tracks from a list of spreadsheets and Next

 Add the p-value_filtered.txt

track and add the Motifs_instances.txt

track by selecting the appropriate checkboxes. Uncheck Aligned Reads as these tracks are already being displayed. Select Create

This will display the enriched regions found in the samples and the locations of the motif instances from the de novo motif discovery (additionally, you could display the regions that were found by searching the JASPAR database). If you have not gone through the steps for

peak detection and motif discovery, these tracks will not be available. The viewer in Figure

23 will appear.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 22

Figure 23: Adding p-value and motif binding site tracks

The two resulting tracks display the detected regions at each location on the chromosome for the NRSF-enriched sample (chip) and align them to the de novo discovered motif binding sites. Switching between chromosomes is possible by selecting a chromosome from the drop-down menu at the top of the window. The positions of the tracks can be changed by dragging the names of the tracks on the left-hand side of the viewer to the appropriate locations.

Step 2 – Explore the Data

Change the Genomic Scale via Zoom using the Mouse

Select the magnifying glass icon ( ) to zoom in on the data. Zoom can be done by using one of several methods: (1) clicking and drawing a box on the plot with the left mouse button (2) using the mouse scroll wheel (3) using the magnifying glass icons at the bottom of the screen

or (4) sliding the bar between the icons. Figure 24

shows a zoomed-in view of one of the enriched regions.

Selecting the home icon ( ) at the bottom of the screen will reset show the whole chromosome. Selecting the selection icon ( ), allows you to select a track and change the properties of that track.

Select ( ) and then select the chip track (or select the Bam Profile (chip) track from the list of tracks in the left pane)

Under the Style tab, select Histogram, Alignments and select Color by Strands . Select

Apply

and the viewer in Figure 24 will appear

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 23

The chip region indicates the location of the ChIP-Seq peak. Since only the ends of the resulting fragments from the ChIP assay are read, enriched regions will generally contain two peaks, one for the forward reads (shown in green) and one for the reverse reads (shown in red). The control ( mock.bam

) does not contain an enriched region at this site. The two motif binding site regions indicate that there are two potential binding sites in this region.

Figure 24: Viewing the zoomed-in view of an enriched region showing two possible binding sites at this location

Shortcut to Showing an Enriched Region

 To go to an enriched region from the p-value spreadsheet, right-click on the row header of the region of interest in the p-value_filtered spreadsheet and select

Browse to Location ; this action will automatically go to the coordinates of the region

 You can also type the name of a gene in the text box at the top of the viewer next to the magnifying glass, and the viewer will display the location of that gene. For example, typing NEUROD1

goes immediately to the NEUROD1 gene (Figure

25)

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 24

NEUROD1 contains a binding site for the NRSF motif. Notice that the enriched region for the NRSF transcription factor is within the NEUROD1 gene. As discussed in the Johnson et al. paper, NRSF is implicated in the repression of NEUROD1, but it was unknown exactly where the NRSF binding occurred. This data indicates that the binding site is within the NEUROD1 gene itself, as shown by the orange box in the Regions track.

Figure 25: Viewing the zoomed-in view of NEUROD1 gene

You may also save the reads shown in the visible genome browser window in selection mode ( ) by right-clicking in the peak area and selecting Dump Displayed Reads to

Spreadsheet .

Additional Analysis

In addition to the items covered in this tutorial, detecting SNPs in the ChIP-Seq sample is possible. You may look for differences in nucleotides across the samples or against a reference genome. This analysis is the same for all of the next generation sequencing workflows (ChIP-Seq, RNA-Seq, and DNA-Seq) and so is not covered in this tutorial.

Also, the ChIP-Seq results can be merged with gene expression data using the Genomic

Integration step in the ChIP-Seq workflow.

End of Tutorial

For additional assistance, contact our technical support staff at +1-314-878-2329 or email support@partek.com

.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite 6.6™ 25

References

Johnson, D. S., Mortazavi, A., Myers, R. M., & Wold, B. (2007). Genome-Wide Mapping

of in Vivo Protein-DNA Interactions (Vol. 316). New York, NY: Science.

Kharchenko, P.V., Tolstorukov, M.Y., & Park, P.J. (2008). Design and analysis of ChIP-Seq experiments for DNA-binding proteins (Vol. 26). Nature Biotechnology.

Kundaje, A. (2010) The phantom-peak coefficient as measure of *-seq data quality . Retrieved from ftp://encodeftp.cse.ucsc.edu/users/akundaje/phantomPeakQuality/ThePhantomPeakCoeffi cient.pdf

.

Neuwald, A. F., Liu, J.S., & Lawrence, C.E. (1995). Gibbs motif sampling: detection of

outer membrane repeats (Vol. 4). Protein Science.

Tutorial last revised: Feb. 2012

Copyright  2012 by Partek Incorporated. All Rights Reserved. Reproduction of this material without express written consent from Partek Incorporated is strictly prohibited.

Analysis of ChIP-Seq Data with Partek ® Genomics Suite™ 26