Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University Announcement • No class this Wed • Change of schedule – miRNA lecture moved to a later time • More time for project – only the last class is used for presentation • Today – lecture more relevant to the projects – Discuss possible class projects – Decide on the groups • Decide on the project topic by next Monday – meeting with me later this week is recommended. http://www.rithme.eu/img/storage_cost.gif Gene Expression Microarray Gene Networks/Pathways • • • • • • Regulatory network Metabolic pathways Signaling pathways Protein-protein interaction networks Gene interaction networks Co-expression network Networks/Pathways Resources • • • • • • www.pathguide.org KEGG HPRD MIMI BIND … Networks/Pathways in Research • Genes don’t act alone • One gene – one disease model is not sufficient • Need to understand how genes coordinate and work together as a system Networks/Pathways • How to build the network? • Manual curation – e.g., IPA • Automatic inference from literature – e.g., NLP based method • Inference from data – e.g., co-expression network • Integration from multiple resources – e.g., STRING database (http://string.embl.de/) Networks/Pathways • How to build the network? • Manual curation – e.g., IPA • Automatic inference from literature – e.g., NLP based method • Inference from data – e.g., co-expression network • Integration from multiple resources – e.g., STRING database (http://string.embl.de/) Networks/Pathways • How to use the network? • Functional inference • Identify new candidate for further investigation • Dynamical simulation • Other types of inferences MicroRNA (miRNA) a Myc E2F3 E2F1 E2F2 17-5p 17-3p 18a 19a 20a 19b 92-1 b c Myc p E2F 1 2 mir-17-92 m Reviewed by: Coller et al. (2008), PLoS Genet 3(8): e146 Figures from Dr. Baltz Agula Gene Co-Expression HMMR siRNA Gene Co-Expression Network • Expansion – Negative correlation – Multiple breast cancer datasets – More anchor genes –… • Is there a way to find all highly correlated genes in multiple datasets? • Do these genes form clusters? Gene Co-Expression Network • Step 1: Compute pairwise PCC values • Step 2: Weighted or unweighted? – Unweighted – need to select a cutoff on PCC – Weighted – need to consider transformation of the data – Keep the scale-free topology • Step 3: Identify “dense” networks (subgraphs) from the overall graph – Hierarchical clustering – Graph mining Graph Mining • Definition of “dense” – Ratio of connectivity: for a subgraph with K nodes and L edges r = L/(K(K-1)/2). – K-core: a subgraph in which every node is connected to at least K other nodes (within this subgraph). • Identification of all the “dense” networks is usually an NP-complete problem. – Heuristic or approximate algorithms are used – e.g., greedy algorithm Frequent network mining • CODENSE – Originally applied to yeast microarray data, later expanded to cancers – Used for functional annotation Data selection and correlation • Selected 23 datasets from Gene Expression Omnibus (GEO) – Search term “human metastatic cancer” – Contain both control and tumor, # sample > 8 – Only primary biopsy • Correlation – PCC > 0.75 (really high similarity) • For CODENSE – Edge support in at least 4 datasets – Connectivity ratio r > 40% (L > r∙n(n-1)/2) – # of nodes > 20 Results from CODENSE • 44 networks are identified • # of nodes: 21 ~ 74 (average 44) • Connectivity: 0.41 ~ 0.78 Finding New Functions Relation to BRCA1 Comparing ER- and ER+ breast cancer patients • Estrogen receptor status is one of the key biomarkers for breast cancer prognosis (ER- indicates poor prognosis) • Select a dataset (GSE2034, Wang et al) from GEO containing 286 samples (77 ER-, 209 ER+) • Compare the ER- group vs ER+ group, select the networks that is most perturbed • The network containing HMMR is most perturbed – more than half of the genes are differentially regulated Select gene signature from a network to predict survival • Use the genes in this network as features to cluster patients in the Rosseta data (295 breast cancer patients) and compare the survival between the two groups. Log-rank test p < 1e-8 Possible Project Topics: 1. Compare the gene expression profiles between tumor and its microenvironment – differential expression, gene co-expression network, and tissue-tissue expression network. 2. Similarly compare the co-expression network between different types of tissues. 3. Herpes virus and cancer; predict human gene targets for virus (Herpes virus) microRNAs. 4. Gene expression “stalling” prediction using “stalling index” from ChIP-seq data for RNA polymerase II. 5. TF binding motif prediction using graph theoretical method. 6. MicroRNA co-expression network to predict microRNA transcription regulation. 7. Your own research problem …