Statistical analysis of DNA microarray data

advertisement
Gene Co-expression Network
Analysis
BMI 730
Kun Huang
Department of Biomedical Informatics
Ohio State University
Announcement
• No class this Wed
• Change of schedule – miRNA lecture moved to a
later time
• More time for project – only the last class is used
for presentation
• Today
– lecture more relevant to the projects
– Discuss possible class projects
– Decide on the groups
• Decide on the project topic by next Monday –
meeting with me later this week is recommended.
http://www.rithme.eu/img/storage_cost.gif
Gene Expression Microarray
Gene Networks/Pathways
•
•
•
•
•
•
Regulatory network
Metabolic pathways
Signaling pathways
Protein-protein interaction networks
Gene interaction networks
Co-expression network
Networks/Pathways Resources
•
•
•
•
•
•
www.pathguide.org
KEGG
HPRD
MIMI
BIND
…
Networks/Pathways in Research
• Genes don’t act alone
• One gene – one disease model is not
sufficient
• Need to understand how genes
coordinate and work together as a
system
Networks/Pathways
• How to build the network?
• Manual curation – e.g., IPA
• Automatic inference from literature – e.g.,
NLP based method
• Inference from data – e.g., co-expression
network
• Integration from multiple resources –
e.g., STRING database
(http://string.embl.de/)
Networks/Pathways
• How to build the network?
• Manual curation – e.g., IPA
• Automatic inference from literature – e.g.,
NLP based method
• Inference from data – e.g., co-expression
network
• Integration from multiple resources –
e.g., STRING database
(http://string.embl.de/)
Networks/Pathways
• How to use the network?
• Functional inference
• Identify new candidate for further
investigation
• Dynamical simulation
• Other types of inferences
MicroRNA (miRNA)
a
Myc
E2F3
E2F1
E2F2
17-5p 17-3p 18a 19a 20a 19b 92-1
b
c
Myc
p
E2F
1
2
mir-17-92
m
Reviewed by: Coller et al. (2008), PLoS Genet 3(8): e146
Figures from Dr. Baltz Agula
Gene Co-Expression
HMMR siRNA
Gene Co-Expression Network
• Expansion
– Negative correlation
– Multiple breast cancer datasets
– More anchor genes
–…
• Is there a way to find all highly correlated
genes in multiple datasets?
• Do these genes form clusters?
Gene Co-Expression Network
• Step 1: Compute pairwise PCC values
• Step 2: Weighted or unweighted?
– Unweighted – need to select a cutoff on PCC
– Weighted – need to consider transformation
of the data
– Keep the scale-free topology
• Step 3: Identify “dense” networks
(subgraphs) from the overall graph
– Hierarchical clustering
– Graph mining
Graph Mining
• Definition of “dense”
– Ratio of connectivity: for a subgraph with K nodes
and L edges
r = L/(K(K-1)/2).
– K-core: a subgraph in which every node is
connected to at least K other nodes (within this
subgraph).
• Identification of all the “dense” networks
is usually an NP-complete problem.
– Heuristic or approximate algorithms are used – e.g.,
greedy algorithm
Frequent network mining
• CODENSE
– Originally applied to yeast microarray data,
later expanded to cancers
– Used for functional annotation
Data selection and correlation
• Selected 23 datasets from Gene Expression
Omnibus (GEO)
– Search term “human metastatic cancer”
– Contain both control and tumor, # sample > 8
– Only primary biopsy
• Correlation – PCC > 0.75 (really high similarity)
• For CODENSE
– Edge support in at least 4 datasets
– Connectivity ratio r > 40% (L > r∙n(n-1)/2)
– # of nodes > 20
Results from CODENSE
• 44 networks are identified
• # of nodes: 21 ~ 74 (average 44)
• Connectivity: 0.41 ~ 0.78
Finding New Functions
Relation to BRCA1
Comparing ER- and ER+
breast cancer patients
• Estrogen receptor status is one of the key
biomarkers for breast cancer prognosis
(ER- indicates poor prognosis)
• Select a dataset (GSE2034, Wang et al) from
GEO containing 286 samples (77 ER-, 209
ER+)
• Compare the ER- group vs ER+ group,
select the networks that is most perturbed
• The network containing HMMR is most
perturbed – more than half of the genes are
differentially regulated
Select gene signature from
a network to predict survival
• Use the genes in this network as features to cluster
patients in the Rosseta data (295 breast cancer
patients) and compare the survival between the two
groups.
Log-rank test
p < 1e-8
Possible Project Topics:
1. Compare the gene expression profiles between tumor
and its microenvironment – differential expression, gene
co-expression network, and tissue-tissue expression
network.
2. Similarly compare the co-expression network between
different types of tissues.
3. Herpes virus and cancer; predict human gene targets for
virus (Herpes virus) microRNAs.
4. Gene expression “stalling” prediction using “stalling
index” from ChIP-seq data for RNA polymerase II.
5. TF binding motif prediction using graph theoretical
method.
6. MicroRNA co-expression network to predict microRNA
transcription regulation.
7. Your own research problem …
Download