In silico study of cancer-related genes and microRNAs 運用微晶片篩選癌症基因及探討其上游之調控microRNAs Ka-Lok Ng (吳家樂) Department of Biomedical Informatics (生物與醫學資訊學系) Asia University Contents Motivation Predict cancer genes based on microarray mRNA expression levels microRNA (miRNA) can act as an oncogene (OCG) or tumor suppressor gene (TSG) Identify cancer-related miRNAs, their target genes, downstream proteinprotein interactions (prediction novel cancerous proteins) (1) Introduction – microarray, cancer, microRNA (2) Methods – input data (3) Results (a) cancer genes prediction (Bioconductor), i.e. prostate/breast cancer (b) correlation study of miRNAs and mRNA expression levels (c) ncRNAppi – A platform for studying microRNA and their target genes’ protein-protein interactions (4) Summary Central dogma of molecular biology Post-transcription regulation – microRNA targets mRNA transcriptome Introduction Types of RNAs RNA mRNA rRNA Ribosomal RNA Participate in protein synthesis tRNA Transfer RNA Interface between mRNA & amino acids ncRNA Non-coding RNA. Transcribed RNA with a structural, functional or catalytic role snRNA snoRNA Small nuclear RNA Small nucleolar RNA -Incl. RNA that Found in nucleolus, form part of the involved in modification spliceosome of rRNA miRNA Micro RNA Small RNA involved regulation of expression Other Including large RNA with roles in chromotin structure and imprinting siRNA stRNA Small temporal RNA. RNA with a role in developmental timing Small interfering RNA Active molecules in RNA interference 癌症的形成及 97年台灣前十大主要癌症死亡原因摘要 順位 1 2 3 4 5 6 7 8 9 10 死亡原因 癌症類型 肺癌 肝癌 結腸直腸癌 女性乳癌 胃癌 口腔癌 前列腺(攝護腺)癌 子宮頸癌 食道癌 胰臟癌 Cause of Death Cancer Type Lung Cancer Hepatocellular Carcinoma Colorectal Cancer Female Breast Cancer Gastric Cancer Oral Cavity Cancer Prostate Cancer Cervical Cancer Esophageal Cancer Pancreatic Cancer 死亡數 38,913 7,777 7,651 4,266 1,541 2,292 2,218 892 710 1,433 1,364 百分率 100% 20.0% 19.7% 11.0% 4.0% 5.9% 5.7% 2.3% 1.8% 3.7% 3.5% Microarray – overview Probe genes Target cDNA labeled by Cy5 (Red) cDNA labeled by Cy3 (Green) By Hanne Jarmer, BioCentrum-DTU, Technical University of Denmark cDNA microarrays Microarrays are used to measure gene expression levels in two different conditions. Green label for the control sample and a red one for the experimental sample. DNA-cDNA or DNA-mRNA hybridization. The hybridised microarray is excited by a laser and scanned at the appropriate wavelenghts for the red and green dyes Amount of fluorescence emitted (intensity) upon laser excitation ~ amount of mRNA bound to each spot If the sample in control/experimental condition is in abundance green/red, which indicates the relative amount of transcript for the mRNA (EST) in the samples. If both are equal yellow If neither are present black Microarray data generation, processing and analysis Image analysis Information processing Image quantitation – locating the spots and measuring their fluorescence intensities Data normalization and integration – construction of the gene expression matrix Data analysis from sets of spot http://www.mathworks.com/company/pressroom/i clustering Gene expression data mage_library/biotech.html analysis and mining – finding differentially expressed genes (DEGs) or clusters of similarly expressed genes Generation from these analyses of new hypotheses about the underlying biological processes stimulates new hypotheses that in turn should be tested in follow-up experiments Introduction – biogenesis of microRNA miRNA gene pri-miRNA (stem-loop structure) processed by Drosha pre-miRNA (65~90 bp) carried by Exportin 5 to cytoplasm mature miRNA (20~25 bp) is generated by the RNaseIII type enzyme Dicer directed by RISC to the miRNA target mRNA cleavage or impede its translation into protein Introduction - miRNAs can play the role of an OCG and TSG When miRNA plays an oncogenic role, it targets TSG, control cell differentiation or apoptosis genes, and leads to tumor formation. if miRNA plays the tumor suppressor role, it targets OCG, control cell differentiation or apoptosis genes, so it can suppress tumor formation. Expect negative correlation of miRNA and mRNA expression profiles integrate the human miRNA-targeted (or siRNA-targetd) mRNA data, protein-protein interactions (PPI) records, tissues, pathways, and disease information to establish a diseaserelated miRNA (or siRNA) pathway database Introduction – cancer-related miRNAs Cancer-related miRNA miR-17-92 cluster, let-7 miR-10b, miR-21, miR-125b, miR-145, miR-155 miR-18, miR-122a, miR-224, miR-199a, miR-199a* miR-195, miR-125a, miR-200a, miR15, miR-16 Cancer type References Lung cancer Martin et al., 2006, Yanaiharaet a. 2006, Takamizawa et al., 2004 Breast cancer Iorio et al., 2005, Si et al., 2007 Liver cancer B-CLL Murakami et al., 2006, Meng et al., 2007, Gramantieri et al., 2007 Calin et al., 2004 Calin et al. 2002 A platform for studying miRNAs and cancerous target genes TarBASE data Experimentally verified miRNA-mRNA pairs Annotation: miR2Disease – disease related miRNA Chromosomal fragile sites miRNA clusters info. CpG island proximal miRNA miRNA miRNA-mRNA anti-correlation pairs NCI-60 cancer data: Expression profile of miRNA and mRNA mRNA Annotation: TAG known OCG, TSG or CRG OMIM disease genes KEGG cancer pathways Number of cell lines for the nine cancer types in the NCI-60 data sets No. of Cell Lines Breast CNS Colon 5 6 7 Lung 9 Leukemia Melanoma 6 10 Ovarian 7 Prostate 2 Renal 8 miRNA, target gene, protein-protein interaction (PPI) BP/MF Overlap BP/MF TG x n1 L1 y n2 L2 z protein miRNA or siRNA protein (mRNA is suppressed) protein (TF) protein Tissue specific miRNA or siRNA target, and its PPI partners up to the second level If the upstream miRNA (or siRNA) is defective, its effect could be amplified downstream. As an illustration, given that a miRNA (or siRNA) targets gene TG, which has two successive PPI partners, i.e. proteins L1 and L2; and suppose that genes TG and L2 are involved with the same disease, then it is highly probably that gene L1 is also related to the same disease quantify by enrichment analysis Input data and Methods Databases : ArrayExpress TAG (Tumor Associated Gene) NCI-60 – miRNA and mRNA gene expression profiles for 9 cancer types TarBase – miRNA targets (experimental verified) miR2Disease a comprehensive resource of miRNA deregulation in various human diseases OMIM – human disease information KEGG – cancer pathways information ncRNAppi 64 prostate cancer tissue and 18 normal prostate tissue samples’ raw data files with U95Av2 a useful tool for identifying ncRNA target pathways PPI data (BioIR) – Seven databases are integrated: HPRD, DIP, BIND, IntAct, MIPS, MINT and BioGRID Gene Ontology (GO) – Biological Function, Molecular Process annotations Tool: Bioconductor Research Protocol Predict DEGs using R and Bioconductor commands Term 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Enter command in R environment library("affy") library("limma") eset<-justRMA() design<-cbind(normal=c(rep(1,18),rep(0,64)),DM=c(rep(0,18),rep(1,64))) fit<-lmFit(eset,design) cont.matrix<-makeContrasts(DMvsNo=DM-normal,levels=design) fit2<-contrasts.fit(fit,cont.matrix) fit2<-eBayes(fit2) topTable(fit2,number=100,adjust="BH") genenames <- as.character(topTable(fit2,number=100,adjust="BH")$ID) adj.P_Val<-signif(topTable(fit2,number=100,adjust="BH")$adj.P.Val,digits=3) logFC <-signif(topTable(fit2,number=100,adjust="BH")$logFC ,digits=3) library("XML") annotation(eset) library("annotate") library("hgu95av2.db") absts <- pm.getabst(genenames,"hgu95av2.db") library("annaffy") atab <- aafTableAnn(genenames,"hgu95av2.db", aaf.handler()) stattable <- aafTable("logFC " = logFC , "adj_P.Val" = adj.P_Val) table <- merge(atab, stattable) saveHTML(table, file = "report.html",title="Significant gene list and its annotation information") Results – DEGs predicted by Bioconductor The result of the top 100 DEGs (either up or down) Eliminate duplicated genes, the predicted total number of DEGs is 85, and the adjusted p-value of all DEGs are less than 1.9 * 10-5. TAG ∩ DEGs 14 known cancer genes among the 85 predicted DEGs (16.5%) Results – miRNAs, DEGs and cancer types Other DEGs Results - The relationship among miR-20a, TGFBR2 and human prostate cancer 16461460 http://ppi.bioinfo.asia.edu.tw/R_cancer/ A platform for studying miRNAs and cancerous target genes A platform for studying miRNAs and cancerous target genes TarBASE data Experimentally verified miRNA-mRNA pairs Annotation: miR2Disease – disease related miRNA Chromosomal fragile sites miRNA clusters info. CpG island proximal miRNA miRNA miRNA-mRNA anti-correlation pairs NCI-60 cancer data: Expression profile of miRNA and mRNA mRNA Annotation: TAG known OCG, TSG or CRG OMIM disease genes KEGG cancer pathways Number of cell lines for the nine cancer types in the NCI-60 data sets No. of Cell Lines Breast CNS Colon 5 6 7 Lung 9 Leukemia Melanoma 6 10 Ovarian 7 Prostate 2 Renal 8 A platform for studying miRNAs and cancerous target genes For a given cancer tissue type, we calculated both the PCC and SRC, r, between the is given by, r n i 1 ( xi x )( yi y ) i1 ( xi x ) 2 i1 ( yi y ) 2 n n where xi and yi denote the expression intensity of miRNA and the miRNA's target gene respectively. One of the troubles with quantifying the strength of correlation by PCC is that it is susceptible to be skewed by outliers. Outliers that are a single data point can result in two genes appearing to be correlated, even when all the other data points not. SRC is a non-parametric statistical method that is robust to outliers. The PCC and SRC are calculated for: Three Affymetrix chips: U95(A-E), U133A, U133B Normalization methods: GCRMA, MAS5, RMA Test of hypothesis of PCC and SRC The Pearson product-moment table to test the significance of a PCC result. The hypothesis being tested is a one-tailed test. A different test is applied for the SRC results. Critical values for one-tailed test using Pearson and Spearman correlation at a significant level of a equal to 0.05 and 0.10. Results – hsa-miR-1:AXL, PCC and SRC calculations Cases where both PCC and SRC are less than or equal to -0.5. Results – hsa-miR-10b:HOXD10 Another example: hsa-miR-21:PTEN (TSG) hsa-miR-15b: BCL2 (TSG) hsa-miR-16: BCL2 (TSG) miR2Disease - hsa-mir-10b initiated diseases, i.e. leukemia, breast, colon, ovarian, prostate cancers. Extension - works in progress Validate how good is correlation prediction Adding further information – CpG island, miRNAs located around CpG islands (i.e., miR-34b, miR-137, miR-193a, and miR-203) are silenced by DNA hypermethylation in oral cancer miRNA clusters, fragile sites Positive correlated miRNA:mRNA pairs may involving TFs ncRNAppi – miRNA, target genes, PPI, and the protocol of enrichment analysis There is a tendency for two directly interacting proteins participate in the same biological process or share the same molecular function. Let a miRNA targeting pathway denoted by miRNA – TG – L1 – L2. We propose to rank the pathway result according to the number of overlapping of the biological processes (or molecular functions) between TG and L1, and between L1 and L2. The Jaccard coefficient (JC) is used to rank the significance of a pathway. JC of set A and B is defined by JC | A B | | A B | where | A B | and | A B | denote the cardinality of A B and A B respectively. JC(TG,L1) JC(L1,L2) miRNA or siRNA protein protein (mRNA is suppressed) protein (TF) protein ncRNAppi – The protocol of enrichment analysis The biological process (BP) and molecular function (MF) annotations are carried from Gene Ontology, which is used to characterize the path TG – L1 – L2, and the JC for the pathway is given by, 1 ave JC BP (TG, L1, L 2) [ JC BP (TG, L1) JC BP ( L1, L 2)] 2 ave where JCBP (TG, L1) and JCBP (TG, L1, L2) denote the JC score of the biological process for segment TG – L1, and the TG – L1– L2 pathway respectively. ncRNAppi – The protocol of enrichment analysis, p-value We assigned a p-value to every JC calculation, this provides a measure of the statistical significance. Here is how we estimate the p-value. Let N be total number of BP found in GO. Assume that TG, L1 and L2 have x, y and z BP annotations respectively. Also, let n1 and n2 be the number of identical BP for TG – L1 and L1 – L2 respectively. Let p1 and p2 be the probabilities that TG – L1 and L1 – L2 have n1 and n2 common BP (or MF) terms respectively, which are defined as; CnN1 C xNnn1 1 C yNnx1 TG L1 p1 C xN C yN x-n1 n1 y-n1 and p2 CnN2 C yNnn22 CzNn2y C yN CzN N ncRNAppi – Extension of TarBase targets Limitations of miRNA target prediction tools There are many tools available for miRNA target genes prediction, such as miRanda, TargetScan, and RNAhybrid etc. A major problem of miRNA target genes prediction is that the prediction accuracy remains uncertain, there was report indicated that the false positive rate could be as high as 24-39% for miRanda, and 22-31% for TargetScan. If the miRNA:mRNA targeting part is uncertain, then the ‘Level 1’ and ‘Level 2’ protein-protein interaction pathways derived from PPI database are doubtful. ncRNAppi – Extension of TarBase targets miRNA target prediction tool – miRanda Mature human miRNA FASTA sequences is downloaded from miRBase (the latest version is 13). Then, we predict the possibilities of miRNA binding with OCG, and TSG. Target prediction tool, miRanda, allows for fining tuning of certain parameters, i.e. MFE threshold, score, shuffle statistics, gap open and gap extension scores. We set MFE threshold and the shuffle statistics to -25 kcal/mol and ON respectively. The rest of the parameters are set to their default values. Once the binding lists of OCG and TSG obtained, then their PPI pathways can be retrieved from the BioIR database. Results - ncRNAppi ncRNAppi provides web-based data access and allows disease assignment for a specific node along miRNA (siRNA) targeting pathways. For example Select miRNA ID – hsa-let-7 Checks the ‘OMIM Disease type for individual node’ box labeled with ‘Target’ and ‘Level-2’ Choose the item ‘lung tumor’ under the ‘TUMOR TYPE’ pull-down menu (OMIM) Select ‘Yes’ under the “Common expression of target, Level-1 and level-2 nodes in KEGG” pathways are ranked according to the Jaccrad index and p-value for BP or MF Example 1) hsa-let7 2) Unigene: liver 3) Target, L1 and L2 are OCG 4) submit Summary The R and Bioconductor are used to predict DEGs using prostate cancer microarray data. By integrating the Tumor Associated Gene (TAG), ncRNAppi and miR2Disease databases, it is found that certain DEGs are regulated by microRNAs. A platform for studying miRNAs and cancer target genes (1) PCC and SRC results are used to quantify the correlation between miRNA and its target expression profiles. The predicted results are annotated with reference to the TAG, OMIM, miR2Disease and KEGG data sets. (2) The main advantage of the two platforms on miRNA-mRNA targeting information is that all the target genes information and disease records are experimentally verified. ncRNAppi platform ncRNAppi provide a powerful tool for identifying cancer-related miRNAs or siRNAs. For instance, the tool allows the possibilities of predicting novel caner genes through tissue or disease specific search. This platform is useful for investigating the regulatory role of miRNAs and siRNAs for cancer study. Acknowledgement National Science Foundation Professor S.C. Lee (李尚熾) - Chung Shan Medical University Mr. Liu Hsueh-Chuan (劉學銓) – former graduate student at Asia University Mr. C.W. Weng (翁嘉偉) – former graduate student at Asia University Mr. Kevin Lo (羅琮傑) – MSc. graduate student at Asia University Thank you for your attention.