miRNA

advertisement
In silico study of cancer-related genes and microRNAs
運用微晶片篩選癌症基因及探討其上游之調控microRNAs
Ka-Lok Ng (吳家樂)
Department of Biomedical Informatics
(生物與醫學資訊學系)
Asia University
Contents
Motivation



Predict cancer genes based on microarray mRNA expression levels
microRNA (miRNA) can act as an oncogene (OCG) or tumor suppressor
gene (TSG)
Identify cancer-related miRNAs, their target genes, downstream proteinprotein interactions (prediction novel cancerous proteins)
(1) Introduction – microarray, cancer, microRNA
(2) Methods – input data
(3) Results
(a) cancer genes prediction (Bioconductor), i.e. prostate/breast cancer
(b) correlation study of miRNAs and mRNA expression levels
(c) ncRNAppi – A platform for studying microRNA and their target
genes’ protein-protein interactions
(4)
Summary
Central dogma of molecular biology
Post-transcription regulation –
microRNA targets mRNA
transcriptome
Introduction
Types of RNAs
RNA
mRNA
rRNA
Ribosomal RNA
Participate in
protein synthesis
tRNA
Transfer RNA
Interface between
mRNA &
amino acids
ncRNA
Non-coding RNA. Transcribed RNA with a structural,
functional or catalytic role
snRNA
snoRNA
Small nuclear RNA Small nucleolar RNA
-Incl. RNA that
Found in nucleolus,
form part of the
involved in modification
spliceosome
of rRNA
miRNA
Micro RNA
Small RNA involved
regulation of expression
Other
Including large RNA
with roles in
chromotin structure and
imprinting
siRNA
stRNA
Small temporal RNA.
RNA with a role in
developmental timing
Small interfering RNA
Active molecules in
RNA interference
癌症的形成及
97年台灣前十大主要癌症死亡原因摘要
順位
1
2
3
4
5
6
7
8
9
10
死亡原因
癌症類型
肺癌
肝癌
結腸直腸癌
女性乳癌
胃癌
口腔癌
前列腺(攝護腺)癌
子宮頸癌
食道癌
胰臟癌
Cause of Death
Cancer Type
Lung Cancer
Hepatocellular Carcinoma
Colorectal Cancer
Female Breast Cancer
Gastric Cancer
Oral Cavity Cancer
Prostate Cancer
Cervical Cancer
Esophageal Cancer
Pancreatic Cancer
死亡數
38,913
7,777
7,651
4,266
1,541
2,292
2,218
892
710
1,433
1,364
百分率
100%
20.0%
19.7%
11.0%
4.0%
5.9%
5.7%
2.3%
1.8%
3.7%
3.5%
Microarray –
overview
Probe genes
Target
cDNA labeled
by Cy5 (Red)
cDNA labeled
by Cy3 (Green)
By Hanne Jarmer, BioCentrum-DTU, Technical University
of Denmark
cDNA microarrays
Microarrays are used to measure gene expression
levels in two different conditions. Green label
for the control sample and a red one for the
experimental sample.
DNA-cDNA or DNA-mRNA hybridization.
The hybridised microarray is excited by a laser
and scanned at the appropriate wavelenghts for
the red and green dyes
Amount of fluorescence emitted (intensity)
upon laser excitation ~ amount of mRNA bound
to each spot
If the sample in control/experimental condition is
in abundance  green/red, which indicates the
relative amount of transcript for the mRNA (EST)
in the samples.
If both are equal  yellow
If neither are present  black
Microarray data generation, processing and analysis
Image analysis
Information processing
 Image quantitation –
locating the spots and
measuring their
fluorescence intensities
 Data normalization and
integration – construction of
the gene expression matrix
Data analysis
from sets of spot
http://www.mathworks.com/company/pressroom/i
clustering
 Gene expression data
mage_library/biotech.html
analysis and mining –
finding differentially
expressed genes (DEGs)
or clusters of similarly
expressed genes
 Generation from these
analyses of new
hypotheses about the
underlying biological
processes  stimulates
new hypotheses that in
turn should be tested in
follow-up experiments
Introduction – biogenesis of microRNA
miRNA gene
 pri-miRNA (stem-loop
structure) processed by Drosha
 pre-miRNA (65~90 bp)
carried by Exportin 5 to
cytoplasm
 mature miRNA (20~25 bp)
is generated by the RNaseIII
type enzyme Dicer
 directed by RISC to the
miRNA target
 mRNA cleavage or impede
its translation into protein
Introduction
- miRNAs can play the role of an OCG and TSG




When miRNA plays an
oncogenic role, it targets TSG,
control cell differentiation or
apoptosis genes, and leads to
tumor formation.
if miRNA plays the tumor
suppressor role, it targets OCG,
control cell differentiation or
apoptosis genes, so it can
suppress tumor formation.
Expect negative correlation of miRNA
and mRNA expression profiles
integrate the human miRNA-targeted
(or siRNA-targetd) mRNA data,
protein-protein interactions (PPI)
records, tissues, pathways, and disease
information to establish a diseaserelated miRNA (or siRNA) pathway
database
Introduction – cancer-related miRNAs
Cancer-related miRNA
miR-17-92 cluster, let-7
miR-10b, miR-21, miR-125b,
miR-145, miR-155
miR-18, miR-122a, miR-224,
miR-199a, miR-199a*
miR-195, miR-125a, miR-200a,
miR15, miR-16
Cancer type
References
Lung cancer
Martin et al., 2006, Yanaiharaet a. 2006,
Takamizawa et al., 2004
Breast cancer
Iorio et al., 2005, Si et al., 2007
Liver cancer
B-CLL
Murakami et al., 2006, Meng et al., 2007,
Gramantieri et al., 2007
Calin et al., 2004
Calin et al. 2002
A platform for studying miRNAs and cancerous
target genes
TarBASE data 
Experimentally verified
miRNA-mRNA pairs
Annotation:
miR2Disease – disease related miRNA
Chromosomal fragile sites
miRNA clusters info.
CpG island proximal miRNA
miRNA
miRNA-mRNA
anti-correlation pairs
NCI-60 cancer data:
Expression profile
of miRNA and mRNA
mRNA
Annotation:
TAG  known OCG, TSG or CRG
OMIM  disease genes
KEGG  cancer pathways
Number of cell lines for the nine cancer types in the NCI-60 data sets
No. of Cell Lines
Breast
CNS
Colon
5
6
7
Lung
9
Leukemia
Melanoma
6
10
Ovarian
7
Prostate
2
Renal
8
miRNA, target gene, protein-protein interaction (PPI)
BP/MF
Overlap BP/MF
TG
x
n1
L1
y
n2
L2
z
protein
miRNA
or siRNA
protein (mRNA is
suppressed)
protein (TF)
protein



Tissue specific miRNA or siRNA target, and its PPI partners up to the second level
If the upstream miRNA (or siRNA) is defective, its effect could be amplified
downstream.
As an illustration, given that a miRNA (or siRNA) targets gene TG, which has two
successive PPI partners, i.e. proteins L1 and L2; and suppose that genes TG and L2
are involved with the same disease, then it is highly probably that gene L1 is also
related to the same disease  quantify by enrichment analysis
Input data and Methods
Databases :
 ArrayExpress





TAG (Tumor Associated Gene)
NCI-60 – miRNA and mRNA gene expression profiles for 9 cancer types
TarBase – miRNA targets (experimental verified)
miR2Disease






a comprehensive resource of miRNA deregulation in various human diseases
OMIM – human disease information
KEGG – cancer pathways information
ncRNAppi


64 prostate cancer tissue and 18 normal prostate tissue samples’ raw data files with
U95Av2
a useful tool for identifying ncRNA target pathways
PPI data (BioIR) – Seven databases are integrated: HPRD, DIP, BIND,
IntAct, MIPS, MINT and BioGRID
Gene Ontology (GO) – Biological Function, Molecular Process annotations
Tool: Bioconductor
Research
Protocol
Predict DEGs using R and Bioconductor commands
Term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Enter command in R environment
library("affy")
library("limma")
eset<-justRMA()
design<-cbind(normal=c(rep(1,18),rep(0,64)),DM=c(rep(0,18),rep(1,64)))
fit<-lmFit(eset,design)
cont.matrix<-makeContrasts(DMvsNo=DM-normal,levels=design)
fit2<-contrasts.fit(fit,cont.matrix)
fit2<-eBayes(fit2)
topTable(fit2,number=100,adjust="BH")
genenames <- as.character(topTable(fit2,number=100,adjust="BH")$ID)
adj.P_Val<-signif(topTable(fit2,number=100,adjust="BH")$adj.P.Val,digits=3)
logFC <-signif(topTable(fit2,number=100,adjust="BH")$logFC ,digits=3)
library("XML")
annotation(eset)
library("annotate")
library("hgu95av2.db")
absts <- pm.getabst(genenames,"hgu95av2.db")
library("annaffy")
atab <- aafTableAnn(genenames,"hgu95av2.db", aaf.handler())
stattable <- aafTable("logFC " = logFC , "adj_P.Val" = adj.P_Val)
table <- merge(atab, stattable)
saveHTML(table, file = "report.html",title="Significant gene list and its annotation information")
Results – DEGs predicted by Bioconductor



The result of the top 100 DEGs (either up or down)
Eliminate duplicated genes, the predicted total number of DEGs is 85,
and the adjusted p-value of all DEGs are less than 1.9 * 10-5.
TAG ∩ DEGs 14 known cancer genes among the 85 predicted DEGs
(16.5%)
Results – miRNAs, DEGs and cancer types
Other
DEGs
Results - The relationship among miR-20a,
TGFBR2 and human prostate cancer
16461460
http://ppi.bioinfo.asia.edu.tw/R_cancer/
A platform for studying miRNAs and
cancerous target genes
A platform for studying miRNAs and cancerous
target genes
TarBASE data 
Experimentally verified
miRNA-mRNA pairs
Annotation:
miR2Disease – disease related miRNA
Chromosomal fragile sites
miRNA clusters info.
CpG island proximal miRNA
miRNA
miRNA-mRNA
anti-correlation pairs
NCI-60 cancer data:
Expression profile
of miRNA and mRNA
mRNA
Annotation:
TAG  known OCG, TSG or CRG
OMIM  disease genes
KEGG  cancer pathways
Number of cell lines for the nine cancer types in the NCI-60 data sets
No. of Cell Lines
Breast
CNS
Colon
5
6
7
Lung
9
Leukemia
Melanoma
6
10
Ovarian
7
Prostate
2
Renal
8
A platform for studying miRNAs and cancerous
target genes
For a given cancer tissue type, we calculated both the PCC and SRC, r,
between the is given by,
r

n
i 1
( xi  x )( yi  y )
i1 ( xi  x ) 2 i1 ( yi  y ) 2
n
n
where xi and yi denote the expression intensity of miRNA and the miRNA's
target gene respectively.
One of the troubles with quantifying the strength of correlation by PCC is
that it is susceptible to be skewed by outliers. Outliers that are a single data
point can result in two genes appearing to be correlated, even when all the
other data points not. SRC is a non-parametric statistical method that is
robust to outliers.
The PCC and SRC are calculated for:
Three Affymetrix chips: U95(A-E), U133A, U133B
Normalization methods: GCRMA, MAS5, RMA
Test of hypothesis of PCC and SRC
The Pearson product-moment table to test the significance of a PCC result. The hypothesis
being tested is a one-tailed test. A different test is applied for the SRC results.
Critical values for one-tailed test using Pearson and Spearman correlation at a significant
level of a equal to 0.05 and 0.10.
Results – hsa-miR-1:AXL, PCC and SRC calculations
Cases where both PCC and SRC are less than or equal to -0.5.
Results – hsa-miR-10b:HOXD10
Another example:
hsa-miR-21:PTEN (TSG)
hsa-miR-15b: BCL2 (TSG)
hsa-miR-16: BCL2 (TSG)
miR2Disease - hsa-mir-10b initiated diseases, i.e. leukemia, breast, colon, ovarian,
prostate cancers.
Extension - works in progress


Validate how good is correlation prediction
Adding further information



– CpG island, miRNAs located around CpG islands (i.e., miR-34b, miR-137,
miR-193a, and miR-203) are silenced by DNA hypermethylation in oral
cancer
miRNA clusters, fragile sites
Positive correlated miRNA:mRNA pairs may
involving TFs
ncRNAppi – miRNA, target genes, PPI, and
the protocol of enrichment analysis
There is a tendency for two directly interacting proteins participate in the same
biological process or share the same molecular function. Let a miRNA targeting
pathway denoted by miRNA – TG – L1 – L2. We propose to rank the pathway
result according to the number of overlapping of the biological processes (or
molecular functions) between TG and L1, and between L1 and L2. The Jaccard
coefficient (JC) is used to rank the significance of a pathway.
JC of set A and B is defined by
JC 
| A B |
| A B |
where | A B | and | A  B | denote the cardinality of A  B and A  B
respectively.
JC(TG,L1)
JC(L1,L2)
miRNA
or siRNA
protein
protein (mRNA
is suppressed)
protein (TF)
protein
ncRNAppi – The protocol of enrichment analysis
The biological process (BP) and molecular function (MF) annotations are
carried from Gene Ontology, which is used to characterize the path TG – L1 – L2,
and the JC for the pathway is given by,
1
ave
JC BP
(TG, L1, L 2)  [ JC BP (TG, L1)  JC BP ( L1, L 2)]
2
ave
where JCBP (TG, L1) and JCBP
(TG, L1, L2) denote the JC score of the
biological process for segment TG – L1, and the TG – L1– L2 pathway
respectively.
ncRNAppi – The protocol of enrichment analysis,
p-value
We assigned a p-value to every JC calculation, this provides a measure of the
statistical significance. Here is how we estimate the p-value. Let N be total
number of BP found in GO. Assume that TG, L1 and L2 have x, y and z BP
annotations respectively. Also, let n1 and n2 be the number of identical BP for
TG – L1 and L1 – L2 respectively. Let p1 and p2 be the probabilities that TG – L1
and L1 – L2 have n1 and n2 common BP (or MF) terms respectively, which are
defined as;
CnN1 C xNnn1 1 C yNnx1
TG
L1
p1 
C xN C yN
x-n1 n1 y-n1
and
p2 
CnN2 C yNnn22 CzNn2y
C yN CzN
N
ncRNAppi – Extension of TarBase targets
Limitations of miRNA target prediction tools
There are many tools available for miRNA target genes prediction, such
as miRanda, TargetScan, and RNAhybrid etc.
A major problem of miRNA target genes prediction is that the prediction
accuracy remains uncertain, there was report indicated that the false positive
rate could be as high as 24-39% for miRanda, and 22-31% for TargetScan.
If the miRNA:mRNA targeting part is uncertain, then the ‘Level 1’ and
‘Level 2’ protein-protein interaction pathways derived from PPI database are
doubtful.
ncRNAppi – Extension of TarBase targets
miRNA target prediction tool – miRanda
 Mature human miRNA FASTA sequences is downloaded from miRBase
(the latest version is 13).
 Then, we predict the possibilities of miRNA binding with OCG, and
TSG.
 Target prediction tool, miRanda, allows for fining tuning of certain
parameters, i.e. MFE threshold, score, shuffle statistics, gap open and gap
extension scores.
 We set MFE threshold and the shuffle statistics to -25 kcal/mol and ON
respectively.
 The rest of the parameters are set to their default values.
 Once the binding lists of OCG and TSG obtained, then their PPI
pathways can be retrieved from the BioIR database.
Results - ncRNAppi
ncRNAppi provides web-based data access and allows disease assignment for a
specific node along miRNA (siRNA) targeting pathways. For example





Select miRNA ID – hsa-let-7
Checks the ‘OMIM Disease type for individual node’ box labeled with ‘Target’ and ‘Level-2’
Choose the item ‘lung tumor’ under the ‘TUMOR TYPE’ pull-down menu (OMIM)
Select ‘Yes’ under the “Common expression of target, Level-1 and level-2 nodes in KEGG”
pathways are ranked according to the Jaccrad index and p-value for BP or MF
Example
1) hsa-let7
2) Unigene: liver
3) Target, L1 and L2
are OCG
4) submit
Summary
The R and Bioconductor are used to predict DEGs using prostate cancer microarray
data. By integrating the Tumor Associated Gene (TAG), ncRNAppi and
miR2Disease databases, it is found that certain DEGs are regulated by
microRNAs.
A platform for studying miRNAs and cancer target genes
(1) PCC and SRC results are used to quantify the correlation between miRNA and
its target expression profiles. The predicted results are annotated with reference
to the TAG, OMIM, miR2Disease and KEGG data sets.
(2) The main advantage of the two platforms on miRNA-mRNA targeting
information is that all the target genes information and disease records are
experimentally verified.
ncRNAppi platform
ncRNAppi provide a powerful tool for identifying cancer-related miRNAs or
siRNAs. For instance, the tool allows the possibilities of predicting novel caner
genes through tissue or disease specific search. This platform is useful for
investigating the regulatory role of miRNAs and siRNAs for cancer study.
Acknowledgement
National Science Foundation
Professor S.C. Lee (李尚熾) - Chung Shan Medical University
Mr. Liu Hsueh-Chuan (劉學銓) – former graduate student at Asia University
Mr. C.W. Weng (翁嘉偉) – former graduate student at Asia University
Mr. Kevin Lo (羅琮傑) – MSc. graduate student at Asia University
Thank you for your attention.
Download