Epigenetic Signatures for Undifferentiated Prostate Cancer Cells 1. BACKGROUND Undifferentiated prostate cancer cells are a critical cellular source of castration-resistant prostate cancer (CRPC). Prostate cancer (PCa) is a heterogeneous disease that contains both differentiated and undifferentiated tumor cells. The undifferentiated PCa cells are pluripotent prostate cancer stem cells (PCSCs) that can generate differentiated PCa cells through asymmetric cell division (ACD) (Qin 2012). PCSCs may reside in quiescent state with minimal cellular activities and insensitive to the mainstay androgen-deprivation therapy (ADT) as well as standard-of-care chemotherapy, therefore represent a critical cellular source of aggressive CRPC. CRPC typically develops following one to three years of ADT, and effective treatments are limited (Hotte 2010). With intrinsic resistance to ADT, PCSCs are unique candidates to study pathological mechanisms of CRPC and discover potential therapeutic targets. To fully understand the molecular mechanisms of CRPC, it is important to systematically detect aberrations in PCSCs and investigate how these aberrations are causing tumor genesis properties and castration resistances. Histone modifications are key epigenetic marks to characterize PCSC. PCSC differentiation is an epigenetic process that normally does not involve genomic sequence alterations, thus epigenetic signatures can be used to effectively distinguish PCSCs from bulk tumors. Although the cancer stem cell hypothesis has been intensively explored in various cancer models, global epigenetic landscape and the epigenetic mechanisms underlying PCSC remain poorly understood. Post-translational modifications of histone proteins are a family of critical epigenetic marks that regulate gene transcription, chromatin remodeling and other fundamental cellular processes (Sawan 2010). Aberrant histone modifications at gene promoter/enhancer regions may lead to androgen mediated silencing of tumor suppressor genes or activation of protooncogenes (Chen 2010). Global histone modification levels have been reported to have significant statistical correlation with prostate cancer status, and can be used as predictor of clinical–pathological parameters, including relapse-free survival rate, preoperative prostate specific antigen (PSA), Gleason score and metastasis status (Seligson 2009; Bianco-Miotto 2010; Ellinger 2010). Individual studies have also linked H3K9me to the repression of PSA (Yamane 2006; Wissmann 2007). These results suggest that histone modification mediated epigenetic mechanisms are actively involved in prostate cancer development and progression. Histone modification mark extended promoter/enhancer domains in cancer cell subpopulations. One major epigenetic function of histone modifications is to mark active genomic regions for transcription factor binding, i.e. promoters and enhancers. The diverse activities of transcription factor bindings in active promoters/enhancers control cell-type-specific gene expressions, therefore are particularly useful in characterizing cancer cell subpopulations. “Super enhancers” are a special category of genomic regions with extended sizes to harbor clusters of enhancers, which are occupied by master transcription factors that regulate stem cell differentiation (Whyte 2013). Super-enhancers are reported to be associated with critical oncogenic drivers in cancer cells (Loven 2013). Similarly, extended size promoters, marked by elongated H3K4me3 peak signals, were also observed in public ChIP-seq samples in our pilot study (Figure 3A). This “super promoter” pattern resembles “super enhancers” in size, transcription factor density, and the ability to distinguish cancer stem cell from bulk tumor cells. Specific histone modification patterns associated with “super promoters” and “super enhancers” present a unique angle to explore key transcription factor regulations in PCSC differentiation, and provide critical insight to the epigenetic mechanisms of CRPC. Our preliminary studies show PCSC specific histone modification patterns. Our previous study (Qin 2012) showed that PCSCs are enriched in prostate cancer cells with little or no prostate specific antigen (PSA) expression (i.e. PSA-/lo cells), whereas PCa cells with high PSA expression (i.e. PSA+ cells) are differentiated PCa cells and are more sensitive to ADT. PCSCs can then be purified through fluorescence-activated cell sorting of PCa cells infected with PSAP-GFP lentiviral reporters. Figure 1 H3K4me3 ChIP-seq in LNCaP PCSC (red) / non-PCSC (blue). A) ChIP-seq peak overlap Venn diagram. B) Heatmap for specific H3K4me3 peaks in LNCaP PCSC/nonPCSC. C) Example of PCSC specific H3K4me3 peaks in the promoter of short isoform of SOCS3. To evaluate whether specific histone modifications can effectively distinguish PCSCs from bulk tumors, we carried out a pilot study to profile active promoter mark H3K4me3 on PCSCs and non-PCSCs derived from LNCaP cells, through whole-genome Chromatin Immunoprecipitation sequencing (ChIP-Seq). Although the majority of H3K4me3 peaks in these two cell populations show good overlap and reproducibility (Figure 1A), a subset of H3K4me3 peak (Figure 1B) is significantly different between PCSCs and non-PCSCs. Consistent with the biological properties of PCSCs and non-PCSCs, the genes specifically associating with H3K4me3-occupied promoters in PCSCs are developmentrelated genes (e.g., SOX11, DACH1, FOXD3, CXCL12) and SC markers (e.g., CD24, ALDH5A1) whereas genes in non-PCSCs are enriched for functions related to cell metabolism and AR signaling (e.g., KLK2, PSA, FKBP5). Strikingly, several neuronal/neural development-related genes are also preferentially occupied by H3K4me3 in PSA-/lo PCa cells (e.g., NRXN1, BRSK1). SOCS3, a key inflammatory signal inhibitor that has been reported to correlate with aggressive prostate cancer progression (Puhr 2010, Pierconti 2011), also has specific H3K4me3 peak in PCSCs on the promoter of its shorter isoform (Figure 1C). To fully understand the underlying mechanisms of the histone modification specificity in PCSC, we propose to define epigenetic signatures for PCSCs and test their functional relevance in CRPC by extending the preliminary study to a comprehensive investigation of multiple key histone modifications. 2. HYPOTHESIS / OBJECTIVE We hypothesize that alterations in histone modification patterns contribute to the unique regulatory mechanisms of PCSCs, resulting in aggressive tumor genesis and propagation, and leading to castration resistance. The objective is to identify epigenetic signatures specific to PCSCs, and ultimately discover core signaling pathways, biomarkers and potential drug targets for CRPC. 3. SPECIFIC AIMS 3.1 Specific Aim 1: To discover PCSC specific combinatorial histone modification patterns . We plan to profile 5 key histone modifications of PCSCs derived from LNCaP cells and LAPC9 xenograft, including H3K4m1, H3K4me3, H3K9me3, H3K27me3 and H3K36me3. We aim to identify differential patterns for each histone modification and combine them to discover PCSC specific combinatorial histone patterns, and explore underlying mechanisms in CRPC. 3.2 Specific Aim 2: To discover PCSC specific super promoters (elongated H3K4me3 peak patterns) We plan to explore a novel elongated peak patterns for H3K4me3 (i.e. “super promoters”) that can distinguish PCSCs from non-PCSCs. This peak pattern is conceptually similar to the “super enhancers” that as been shown to have master regulatory functions in maintaining the differentiation statuses in stem cells (Whyte 2013). We aim to identify PCSC/non-PCSC specific “super promoters” and associated gene sets for functional analyses. 4. RESEARCH STRATEGY 4.1 Specific Aim 1: To discover PCSC specific combinatorial histone modification patterns Rational: Histone modifications vary greatly in genomic distributions, peak patterns and regulatory functions (Strahl 2000). Accumulating studies revealed highly coordinated interactions between multiple types of histone modifications to accomplish different regulatory functions in dynamic cellular environments (Wang 2008, Suganuma 2011, Linghu 2013). A specific histone modification pattern, termed "bivalent domains", consists of activate mark H3K4me3 and repressive mark H3K27me3 co-localized on gene promoters (Bernstein 2006; Sanz 2008). Bivalent domains typically occur on developmental genes that are silenced in stem cells but posed to be activated at developmental stages, therefore are highly relevant in characterizing PCSCs differentiation. More generally, the great diversity of histone modifications provides a variety of possible combinatorial patterns that may be used to discover the unique mechanism of CRPC tumor genesis and progression. However, a comprehensive investigation of these combinatorial histone modification patterns in PCSCs has not been done, and key epigenetic signatures remain undefined for PCSCs. We aim to bridge this knowledge gap by applying novel bioinformatics methodologies as well as integrating existing data processing pipelines where appropriate to systematically investigate combinatorial histone modification patterns for PCSCs, including both known combinatorial patterns, i.e. bivalent domains, and de novo combinatorial patterns. We propose to carry out correlation studies of these epigenetic signatures in clinical samples and explore their functional relevance in CRPC. Experimental Design: We plan to perform ChIP-seq profiling for both LNCaP cells and LAPC9 xenograpfts, in 5 key histone modifications, including H3K4me1, H3K4me3, H3K9me3, H3K27me3 and H3K36me3, will be (Table 1). H3K4me3 and H3K27me3 are active/repressive promoter marks respectively, and co-localization of H3K4me3 and H3K27me3 marks the bivalent domain promoters that could poise genes for activation upon stem cell differentiation. H3K4me1 are active enhancer mark. Enhancer activity is highly dynamic and transient during PCSC differentiation, and mark cell lineage specific regulations. H3K9me3 are repressive marks for heterochromatin domain and have been reported to the down regulation of PSA (Yamane 2006; Wissmann 2007). H3K36me3 is typical gene body mark that positively correlate with actively transcribed genes. Together these 5 histone modification cover active/repressive marks in the majority of regulatory regions in the genome, and represent a collection of key histone modifications with various regulatory functions. For each histone modification, we will perform ChIP-seq profiling with 2 biological replicates for PCSCs and non-PCSCs respectively. Two replicates of input controls will also be included. Table 1 Proposed Histone Modification ChIP-seq. Activation Repression Promoter Mark H3K4me3 H327me3 Enhancer Mark H3K4me1 Gene Body Mark H3K36me3 Heterochromatin Mark H3K9me3 Prostate cancer cells are infected with PSAP-GFP lentiviral reporter and incubated. For LNCaP system, the infected cells will be incubated for 72 hours. For LAPC9 xenograft, the infected cells were incubated for ~18hrs and injected into NOD/SCID mice to establish reporter tumors. PSA-/lo and PSA+ isogenic subpopulations are purified through fluorescence-activated cell sorting (FACS). The top 10% GFP-bright (i.e., PSA+) cells and bottom 2-6% negative cells (i.e., PSA-/lo) were selected. We have tested and validated one novel library preparation method, called ThruPLEX, to directly prepare picogram amounts of DNA for Illumina next generation sequencing. Using this method, we can determine histone modification patterns in very limited number of PCSCs (e.g. 5000). Data Preprocessing: Raw ChIP-seq reads will be mapped to the reference genome using short read mapping software BOWTIE (Langmead 2009), only uniquely mapped reads will be retained. We will use MACS (Zhang 2008) to generate the whole genome ChIP- seq signals. The ChIP-seq profiles will be normalized to a total read number of 10 million per sample. We will build a UCSC track hub to integrate all histone modification ChIPseq datasets for flexible visualization. Processed ChIP-seq data will be compiled into standard WIGGLE format for UCSC genome browser (http://genome.ucsc.edu/) (Kent 2002). Raw sequencing data and processed ChIP-seq WIGGLE files will be deposited in NBCI Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/). Identification of Combinatorial Histone Modification Patterns: We will develop a supervised machine learning approach to discriminate the histone modification profiles of PCSCs and non-PCSCs, and identify significant differentially marked regions. A classification model based on recursive feature elimination support vector machines (RFE-SVM) will be built on the preprocessed histone modification ChIP-seq signals by assigning different weights to signals to maximize separations between PCSCs and nonPCSCs. This model adjusts itself to fit the data by recursively removing the most insignificant signal (feature) until the prediction error rate is lower than the termination threshold. The remaining features will be sorted by the absolute weights to generate a ranked list of differential peaks (Figure 1A). Figure 2 Histone modification ChIP-seq differential analysis pipeline. A) ChIP-seq signal extraction and differential analysis between PCSC/Non-PCSC. B) Combinatorial pattern discovery for multiple histone modifications. Different histone modification marks co-localize in different functional regions and may not overlap with each other. We will associate peaks with their target genes and combine the differential results for each individual histone mark at gene level. For each histone mark, the differential status of a certain gene is represented by a differential score D, which is defined as non-linear transformation of the differential p-value using the following logistic function (Figure 2B): 1- p a D= 1+ p a where p is the signed p-value with positive value for enrichment and negative value for depletion, and a represents the significant level in differential test, typically a = 0.01. The differential score is bounded between -1 and +1, which represent PCSC specific and non-PCSCs specific signals respectively, therefore allows comparisons of differential results of multiple histone modification for combinatorial pattern discovery. We will use ChromHMM software to identify common histone mark patterns (Kellis 2012). Genes with similar differential statuses will be grouped together and define specific gene set signatures for PCSCs and non-PCSCs (Figure 2D). These gene sets signatures will be tested for functional significance, including gene ontology (GO) enrichment, pathway analysis and other correlations through gene set enrichment analysis (GSEA) (Subramanian 2005). Our lab has access to commercial software applications that provide extensive analysis based on large scale literature and database mining, including Ingenuity Pathway Analysis (IPA) and Oncomine (Rhodes 2007), a canceroriented application. To explore the clinical relevance, the combinatorial histone modification patterns and gene set signatures can also be integrated with proteomic data and various tumor characteristics of the clinical samples in TCGA database (The Cancer Genome Atlas, http://cancergenome.nih.gov/). We plan to perform meta-analyses to combine these analysis results and generate a comprehensive framework connecting molecular signatures to the underlying mechanisms and finally to the clinical outcomes. 4.1.5 Expected Outcome and Potential Problems We expect to identify PCSC specific signatures for combinatorial histone modification patterns, as well as PCSC specific signatures for each individual histone modifications. We expect to generate a list of gene signatures defined by the specific histone modification patterns for experimental validations with our bench collaborator. Through correlation study with existing clinical data, we may further discover novel biomarkers and drug-targets for PCSCs. One potential technical problem is that the RFE-SVM approach in the histone modification differential analysis is a highly computational intensive algorithm, and may be a computational bottleneck in the analyses. To address this issue, we will use the BlueBioU super computer in Rice University as extra computation resource. Although our goal is to develop automatic data analysis pipeline for large scale whole genome profile comparison, computer programs may not be optimal in handling certain biological scenarios. Manual curations are necessary in some cases, especially in the stages of target gene discovery and functional analyses. We expect to adjust our data processing pipeline according to the feedback of bench collaborators and iteratively improve the analysis quality. 4.2 Specific Aim 2: To discover PCSC specific super promoters (elongated H3K4me3) Rational: As one of the most widely studied histone mark, H3K4me3 is an active promoter mark with peak intensities positively correlate with gene transcriptions (Bernstein 2005, Heintzman 2007). In contrast to the typical sharp and narrow H3K4me3 peak pattern located in the proximal promoters, we observed an elongated H3K4me3 peak pattern that spreads into the gene body for several hundreds to thousands base pairs (Figure 3). These unusual H3K4me3 peak patterns mark a significantly larger active promoter regions that may be capable to recruit more transcription factors and more complex transcription machinery for dynamic regulations. It’s conceptually similar to the “super enhancers” that mark active enhancer domains for master transcription factors that regulate embryonic stem cell differentiations (Whyte 2013), hence we termed them as “super promoters”. We hypothesize that similar mechanisms in “super enhancer” also exist in “super promoters” patterns marked by extensive H3K4me3 signal domains. The master regulator of “super promoter” may possess critical functions in maintaining differentiation status in PCSCs, therefore provide potential targets to interfere the CRPC development. “Super promoter” patterns are commonly neglected in existing cancer histone modification studies that focused on the heights of H3K4me3 peaks. Through large scale data mining in the public ENCODE ChIP-seq dataset (Raney 2011; Rosenbloom 2012), we observed a wide existence of “super promoters” in a cell population specific manner (Figure 3A), suggesting potential relationships with cancer statuses. In our preliminary H3K4me3 ChIPseq data for LNCaP cells, we Figure 3 Enlongated H3K4me3 peak (“super promoter”) pattern also found evidences of detected in A) cancer (red) v.s. normal (blue) in ENCODE “super promoters” patterns database. B) PCSCs (red) v.s. Non-PCSCs (blue) in LNCaP cells. distinguishing PCSCs/nonPCSCs cells (Figure 3B). TMEFF2, an androgen-regulated gene with anti-proliferative effects in prostate cancer cells (Gery 2002), displays an elongated H3K4me3 peaks in non-PCSC cells, suggesting such peak pattern may be highly relevant in revealing the connections between prostate cancer stem cell differentiation /proliferation and androgen dependence. Data Preprocessing: The reads mapping, whole genome profile generation and visualization processes are same as Aim 1. To facilitate differential peak width detection, we propose to use a modified approach developed for nucleosome positioning data processing (Chen 2012) to extract histone mark signals, based on the rational that histone modification signals follow the spatial occupancy of nucleosomes. We consider three basic peak transition patterns in signal extraction (Figure 4A): 1) Peak intensity change, which reflects the difference of histone modification levels within same genomic regions. This is the most studied case in ChIP-seq data analysis. 2) Peak location shift, typically caused by the nucleosome displacement. The significance of location shift of histone modification signals depends on the extent of nucleosome displacement and need to be processed accordingly in the signal extraction procedure. 3) Peak size change, which is commonly observed in broad domain histone modification patterns, such as H3K36me3 and H3K9me3. In our case, it is also critical to accurately extract the peak width information in H3K4me3 data for identification of “super prmoters” with extended peak sizes. To allow effective comparisons across different transition patterns, we develop an adaptive binning method based on the principle that converts complicated histone modification peak patterns in the latter two cases to simple peak height changes. We define data bins with respect to these three transitions patterns and extract ChIP-seq intensities (Figure 4A): 1) For peak with occupancy change, we will use the consensus peak boundary to define the data bin; 2) For peaks with location change, we will either assign peaks to separate bins if they are far apart, or allocate them in one data bin if they are close, according to distance threshold measured by the “fuzziness” of histone modification peaks using a statistical model derived to detect the nucleosome position deviations (Jiang and Pugh 2009) (Chen 2012). 3) For peaks with size expansion, we will split the wide peak into multiple bins at the boundary of narrow peaks. Figure 4 A) ChIP-seq signal extraction by adaptive binning methods. B) “Super Promoter” discovery. Identification of elongated H3K4me3 peaks: The preprocessed signals were extracted at aligned boundaries for different samples and allow sensitive comparisons across samples. Through this process we convert the “wide vs. narrow” or “still vs. moved” peak comparisons into the classic “present vs. absent” comparisons. Peak width changes can be detected as differential signals adjacent to non-differential signals from proximal genomic regions. Base don this principle, we will develop a signal processing pipeline to reconstruct wide peaks from divided signals with differential status and detect significant peak width changes as “super promoters” (Figure 4B). “Super promoters” represent a unique set of genes that may possess more sensitive and flexible responses to cellular environments, including PCSC differentiation and androgen receptor signaling. We plan to perform functional analyses similar to Aim 1 on super promoter genes, including gene ontology (GO) enrichment, pathway analysis, gene set enrichment analysis (GSEA) (Subramanian 2005), Ingenuity Pathway Analysis (IPA) and Oncomine (Rhodes 2007). Active promoters often harbor binding motifs which could identify upstream transcription factors. We will perform sequence analyses to discover enriched motifs in the “super promoters”, including both known motif and de novo motifs. Based on the detected motifs, we may discover master upstream regulators to “super promoter” genes and further reveal the regulatory network of “super promoters”. We plan to perform meta-analysis to study the correlation of “super promoter” genes with various tumor characteristics in TCGA clinical samples. We aim to combine these results to generate a comprehensive network to of the two cell subpopulations and discover novel molecular mechanisms in maintaining the differentiation status of PCSCs, as well as better define CRPC aggressiveness. Expected Outcome and Potential Problems: Using the customized differential peak width pipeline, we expect to identify “super promoter” patterns in both PCSCs and non-PCSCs. We will define associated gene signatures of the “super promoters” in PCSCs and nonPCSCs. We expect to discover regulatory mechanisms similar to the “super enhancer” in “super promoter” genes, such as discovering master regulatory transcription factors in PCSC differentiation. Through meta-analyses in clinical samples, we expect to establish potential connections between “super promoter” genes and clinical classifications of CRPC. On potential problem is that elongated H3K4me3 peak patterns may be the results of alternative promoters of multiple isoforms. To accurately distinguish alternative promoters from single elongated H3K4me3 peak domain, we will collaborate with Dr. Tang’s group to validate the detected “super promoters”. If the “super promoter” patterns cover significant alternative promoters in gene annotation, we may either perform EST sequencing or whole genome RNA sequencing to identify the transcripts of the interested genes. 5. COLLABORATION We are collaborating with Dr. Dean Tang from MD Anderson Cancer Center, who has extensive experience in prostate cancer research and will provide suggestions from biological and disease perspectives. Dr. Tang's lab has been routinely using both hormone-naive and hormone-refractory prostate tumor samples. Dr. Tang's lab has utilized 162 primary untreated patient prostate tumors ranging from Gleason Grade 6 to 9/10. These samples have been employed not only in efforts aimed to establish 'primary' xenograft tumors but also in preparing single cell fractions for biological studies. Dr. Tang's lab has been collaborating with Dr. Chris Logothetis's group at the GU Med Oncology of M.D Anderson Cancer Center by using prostate cancer patients BM aspirates to study the relationship between PCSCs and metastasis. The Human Genome Sequncing Center (HGSC) of Baylor College of Medicine (BCM), which is one of three large-scale sequencing centers funded by the National Institutes of Health, will provide sequencing service for the histone modification ChIP-seq. The Dan L. Duncan Cancer Center (DLDCC) at Baylor College of Medicine will provide high performance computer cluster, data storage, and software maintenance. 6. OVERACRHING CHALLENGES AND FOCUS AREAS This proposal aim to address the overarching challenge of develop effective treatment and understanding mechanisms of resistance for men with high-risk or metastatic prostate cancer, i.e. CRPC. With intrinsic capability of tumor genesis and insensitivity to ADT, PCSCs are a driving force of CRPC development, therefore present a key therapeutic target. However, PCSCs make up less than 1% of the prostate cancer cell population (Qin 2012), and its molecular signatures are greatly diluted in the tumor environment and have not been subject to comprehensive characterization. A systematical investigation of key epigenetic signatures in PCSCs will greatly increase current understanding of the regulatory mechanisms of cancer stem cells in CRPC tumor progression and may lead to identification of novel biomarker/drug targets for CRPC. In specific aim 1, we expect to define combinatorial histone modification patterns in PCSCs and discover target genes as well as core signaling pathways leading to androgen independence or alternative mediations in CRPC. In specific aim 2, we will explore a specific novel H3K4me3 modification patterns that suggest potential extra transcription activities specific to cancer status, which may reveal novel regulatory co-factors related to castration resistance. We propose to carry out comprehensive bioinformatics studies in two focus areas of prostate cancer research program (PCRP). 1) Biomarker discovery. We aim to identify epigenetic signatures in PCSCs as biomarkers with better specificity to classify cancer stem cells in CRPC. 2) Resistance Mechanisms. We aim to reveal the underlying mechanisms of androgen independence or alternative mediation through PCSCS cell proliferation and differentiation in CRPC. This proposal emphasizes on the bioinformatics analyses of large-scale histone modification ChIP-seq datasets. We propose to develop novel data processing methodologies and integrate existing approaches where appropriate, and ultimately provide comprehensive bioinformatics solutions for CRPC biomarker discovery through epigenetic signatures. We plan to develop an integrated bioinformatics package for combinatorial histone modification profiling, which can be used in not only in classification of PCSCS/non-PCSCs, but also in more general comparative epigenetic studies in other prostate cancer cell subpopulations. We expect to release this data analysis package as an open source software to the research community, along with the publication of the specific results on epigenetic signatures of CRPC from PCSCs.