Outline • Community detection • Generative model for modular networks • Variational Bayesian inference • Validation • Applications Community detection • Model global structure (e.g. summarize data) • Visualize structure (e.g. graph layout) • Analyze interactions (e.g. affinities within/between groups) • Predict (e.g. function, attributes, links) Image from Giorgini, et. al., Genome Biol, 2005 Community detection: Background • Physics literature • Machine learning literature • Newman et. al. (2002,...) • Nowicki & Snijders (2001) • Bornholdt & Reichardt (2006) • Kemp et. al. (2004) • Hastings (2006) • Leicht & Newman (2007) • ... • Airoldi et. al. (2007) • Xu et. al. (2007) • Sinkkonen et. al. (2007) • Parametrized cost function (energy), mostly focus on how to optimize • Complex models, approximate inference (often expensive) Complexity control Increasing complexity Images from http://research.microsoft.com/~minka/statlearn/demo/ and Giorgini, et. al., Genome Biol, 2005 Community detection as inference • Data D={Aij} i,j=1,...,N; Aij=1 if nodes i and j connected • Parameters bias of die !, bias of coins ! • Latent variables {zi}, assignments of nodes to communities • Given D, infer z, " and K Outline • Community detection • Generative model for modular networks • Variational Bayesian inference • Validation • Applications Community detection as inference Sampling Model (parameters !, latent variables z, complexity K) Data Inference Constrained stochastic block model • Nodes belong to “blocks” of varying size • Roll die for assignment of nodes to blocks • Probability of edge between two nodes depends only on block membership • Flip (one of two) coins for edges 0 10 20 30 40 50 60 70 • Result: mixture of Erdos-Renyi graphs 80 90 100 0 10 Holland, Laskey, Leinhardt 1983; Wang and Wong, 1987 20 30 40 50 60 nz = 1996 70 80 90 100 Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node 1 2 3 4 5 6 7 8 9 • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij Generating modular networks • For each node: • Roll K-sided die with bias ! to determine zi=1,...,K, the (unobserved) module assignment for ith node • For each pair of nodes (i,j): • If zi=zj, flip “in community” coin with bias !c to determine edge Aij • If zi"zj, flip “between communities” coin with bias !d to determine edge Aij 1 2 7 3 5 9 4 6 8 Outline • Community detection • Generative model for modular networks • Variational Bayesian inference • Validation • Applications Community detection as inference Sampling Model (parameters !, latent variables z, complexity K) Data Inference Community detection as inference • From observed graph structure, infer distributions over module assignments, model parameters, and model complexity 1 2 7 3 5 9 4 6 8 Community detection as inference • From observed graph structure, infer distributions over module assignments, model parameters, and model complexity 1 2 7 3 5 9 4 6 8 Multiplication is easy, but normalization is intractable O(KN); use mean-field variational approach Variational Bayes • Jensen’s inequality (log of expected value bounds expected value of log) for any distribution q Variational Bayes (Feynman 1972; MacKay, Jordan, Ghahramani, Jaakola, Saul 1999) Variational Bayes for modular networks • Iteratively optimize F{q;A} by updating distributions over parameters {!, !} and latent variables {z} 34 Algorithm 2 Variational Bayes for maximum evidence inference 1: t=0 2: choose initial distributions q (0) (Z), q (0) (Θ) 3: repeat 4: E-step: calculate ln q (t+1) (Z) ∝ "ln p(D, Z|Θ, K)p(Θ|K)#q(t) (Θ) 5: M-step: calculate ln q (t+1) (Θ) ∝ "ln p(D, Z|Θ, K)p(Θ|K)#q(t+1) (Z) 6: t←t+1 7: until F[q (t+1) (Z), q (t+1) (Θ)] − F[q (t) (Z), q (t) (Θ)] ≤ δ or t = Tmax a local optimum of the variational free energy. To initialize the algorithm, one chooses initial distributions q (0) (Z) and q (0) (Θ). In the E-step, the variational free energy is optimized as a functional of q(Z) for the current parameter distribution, i.e. q (t+1) (Z) = argmax F[q(Z), q (t) (Θ)]. In the M-step, the variational free energy Validation: complexity control • Automatic complexity control: probability of occupation for extraneous modules goes to zero http://vbmod.sourceforge.net Phys. Rev. Lett. 2008 vol. 100 Outline • Community detection • Generative model for modular networks • Variational Bayesian inference • Validation • Applications Validation: “four groups” test • Mutual information between true and inferred latent variable assignments for N=128 nodes, K=4 modules, average node degree 16 Danon et. al (2005) Validation: Complexity control • Comparison of our method (VB) to alternative method (ICL, based on BIC) for synthetic N=60 node networks and KTrue=3,4,5 modules Validation: Large-scale network Validation: Runtime • O(MK) runtime; ~400 sec for N=106 nodes, K=4 modules, average node degree 16 $ !" # !" 3)456*,.72,89 ! !" " !" !! !" !# !" # !" $ !" % !" ()*+,-./0.(/1,2 & !" ' !" Validation: NCAA football schedule • Correctly infer K=12 conferences nodes: teams edges: games shape: conference color: inferred module Validation: NCAA football schedule 43 35 62 100 44 27 32 19 55 86 13 15 72 37 39 60 64 28 113 77 63 71 18 97 76 57 96 45 67 110 14 107 16 106 90 2 49 104 7 101 3 48 33 34 88 40 61 46 87 21 66 93 38 92 114 59 98 58 20 65 26 103 70 91 83 25 12 41 17 53 82 11 94 105 75 6 1 51 108 99 85 50 54 10 115 89 5 24 102 95 4 73 29 30 31 81 56 80 36 42 111 47 84 68 74 22 52 • Correctly infer K=12 conferences 78 23 9 8 69 112 109 79 nodes: teams edges: games shape: conference color: inferred module Application: E. Coli protein-protein network Colicin_E3 kba pys2 Colicin_E7 Cell_division_protein_ftsN Cell_division_protein_ftsW Cell_division_protein_ftsQ Immunity_Protein Colicin_E9_Immunity_Protein YbjP YcdO LivJ DsbB Peptidoglycan_synthetase_ftsI_precursor_(Penicillin-binding_protein_3)_(PBP-3) Nuclease_sbcCD_subunit_D Chemotaxis_protein_methyltransferase Cytochrome_c-type_biogenesis_protein_ccmH_precursor YodA OppA pys2 ATP-dependent_Clp_protease_proteolytic_subunit_(Endopeptidase_Clp)_(Caseinolytic_protease)_(Protease_Ti)_(Heat_shock_protein_F21.5) lamB LivK OstA Cell_division_protein_ftsL DppA GltI DsbA YibQ btuB Transcriptional_activator_cadC Exonuclease_sbcC Ornithine_carbamoyltransferase_chain_I_(OTCase-1) Acriflavine_resistance_protein_A_precursor ADA_regulatory_protein_(Regulatory_protein_of_adaptative_response)_[Contains:_Methylated-DNA-protein-cysteine_methyltransferase_(O-6-methylguanine-DNA_alkyltransferase)] ZnuA OmpA Alpha-galactosidase_(Melibiase) RcsF Acriflavine_resistance_protein_B Cell_division_protein_ftsZ YedD RNA_polymerase_sigma_factor_for_flagellar_operon_(Sigma-F_factor)_(Sigma-27)_(Sigma-28) Aquaporin_Z_(Bacterial_nodulin-like_intrinsic_protein) secB RNA_polymerase_sigma_factor_rpoD_(Sigma-70) Regulator_of_sigma_D Arabinose_operon_regulatory_protein Aerotaxis_receptor major_coat_protein_-_phage_Pf3 Cell_division_protein_ftsK TraV_protein_precursor 60_kDa_inner-membrane_protein rpoE DNA_replication_and_repair_protein_recF Acetyl_esterase Tryptophan_synthase_beta_chain Septum_site-determining_protein_minC Probable_ammonium_transporter TraK_protein Ferredoxin,_2Fe-2S Ferrienterobactin_receptor_precursor_(Enterobactin_outer-membrane_receptor) Fructose-bisphosphate_aldolase_class_II_(FBP_aldolase) rseA N_utilization_substance_protein_A_(nusA_protein)_(L_factor) Recombination_protein_recR Putative_ATP-binding_component_of_a_transport_system parE Septum_site-determining_protein_minD_(Cell_division_inhibitor_minD) Nitrogen_regulatory_protein_P-II_1 L-asparaginase_II_precursor_(L-asparagine_amidohydrolase_II)_(L-ASNase_II)_(Colaspase) Chromosomal_replication_initiator_protein_dnaA 1FIA_B Pyruvate_dehydrogenase_E1_component Ribonuclease_G_(RNase_G)_(Cytoplasmic_axial_filament_protein) UTP-glucose-1-phosphate_uridylyltransferase_(UDP-glucose_pyrophosphorylase)_(UDPGP)_(Alpha-D-glucosyl-1-phosphate_uridylyltransferase)_(Uridine_diphosphoglucose_pyrophosphorylase) ATP-dependent_Clp_protease_ATP-binding_subunit_clpX Hemolysin_secretion_protein_D,_chromosomal Methyl-accepting_chemotaxis_protein_II_(MCP-II)_(Aspartate_chemoreceptor_protein) crp 30S_ribosomal_protein_S12 zipA Hypothetical_39.8_kDa_protein_in_OSMY-DEOC_intergenic_region_(O357) TraB_protein Phosphate_transport_ATP-binding_protein_pstB Cell_Division_Protein_Ftsz Biopolymer_transport_exbD_protein Transcription_antitermination_protein_nusG Cell_division_protein_ftsA mopA yljA Outer_membrane_protein_tolC_precursor Fructose-bisphosphate_aldolase_class_I_(FBP_aldolase) Cell_division_inhibitor MDH Phosphoribosylaminoimidazole-succinocarboxamide_synthase_(SAICAR_synthetase) 50S_ribosomal_protein_L7/L12_(L8) mopA clpA DNA-binding_protein_H-NS_(Histone-like_protein_HLP-II)_(Protein_H1)_(Protein_B1) Threonine_3-dehydrogenase ftnA Polyribonucleotide_nucleotidyltransferase_(Polynucleotide_phosphorylase)_(PNPase) rpoA GroES 50S_ribosomal_protein_L9 Hemolysin_secretion_ATP-binding_protein,_chromosomal Putative_tagatose_6-phosphate_kinase_gatZ Maltose-binding_periplasmic_protein_precursor_(Maltodextrin-binding_protein)_(MMBP) Capsular_synthesis_regulator_component_B flagellar_motor_switch_protein_FliG_-_Escherichia_coli_(strain_K-12) Dna_Gyrase_B Tagatose-bisphosphate_aldolase_gatY Aspartate_Carbamoyltransferase_Catalytic_Chain 8-amino-7-oxononanoate_synthase_(AONS)_(8-amino-7-ketopelargonate_synthase)_(7-keto-8-amino-pelargonic_acid_synthetase)_(7-KAP_synthetase)_(L-alanine-pimelyl_CoA_ligase) ATP_synthase_delta_chain ATP_synthase_epsilon_chain_(ATP_synthase_F1_sector_epsilon_subunit) Phosphomethylpyrimidine_kinase_(HMP-phosphate_kinase)_(HMP-P_kinase) aroG dnaK arginine_decarboxylase_(EC_4.1.1.19) L-ribulose-5-phosphate_4-epimerase_(Phosphoribulose_isomerase)pyrI Phosphate_acetyltransferase_(Phosphotransacetylase) Acetylglutamate_kinase_(NAG_kinase)_(AGK)_(N-acetyl-L-glutamate_5-phosphotransferase) MalT_regulatory_protein aroG Dihydrolipoamide_dehydrogenase_(E3_component_of_pyruvate_and_2-oxoglutarate_dehydrogenases_complexes)_(Glycine_cleavage_system_L_protein) DNA-directed_RNA_polymerase_beta_chain_(Transcriptase_beta_chain)_(RNA_polymerase_beta_subunit) grpE Ribonucleoside-diphosphate_reductase_1_beta_chain_(Ribonucleotide_reductase_1)_(B2_protein)_(R2_protein) major_prion_PrP-Sc_protein_precursor RNA_polymerase_sigma-E_factor_(Sigma-24) GroEL 2-isopropylmalate_synthase_(Alpha-isopropylmalate_synthase)_(Alpha-IPM_synthetase) Sigma-E_factor_regulatory_protein_rseB_precursor RNA_polymerase_sigma-32_factor_(Heat_shock_regulatory_protein_F33.4) D-amino_acid_dehydrogenase_small_subunit Argininosuccinate_synthase_(Citrulline-aspartate_ligase) Ribonuclease_E_(RNase_E) Chaperone_protein_dnaJ_(Heat_shock_protein_J)_(HSP40) Selenocysteine_lyase_(Selenocysteine_reductase)_(Selenocysteine_beta-lyase)_(SCL) pyrI ATP_synthase_B_chain glf S-adenosylmethionine_synthetase_(Methionine_adenosyltransferase)_(AdoMet_synthetase)_(MAT) ECs5222 ATP_synthase_beta_chain tufA Acetolactate_synthase_isozyme_III_large_subunit_(AHAS-III)_(Acetohydroxy-acid_synthase_III_large_subunit)_(ALS-III) Glutamate_decarboxylase_alpha_(GAD-alpha) 50S_ribosomal_protein_L23 Glycyl-tRNA_synthetase_beta_chain_(Glycine-tRNA_ligase_beta_chain)_(GlyRS) 5,10-methylenetetrahydrofolate_reductaseATP_synthase_alpha_chain colicin_E3__chain_A Thiazole_biosynthesis_protein_thiG yigB Hfq_protein_(Host_factor-I_protein)_(HF-I)_(HF-1) Phenylalanyl-tRNA_synthetase_beta_chain_(Phenylalanine-tRNA_ligase_beta_chain)_(PheRS) thrS sspBATP_synthase_A_chain_(Protein_6) pyrI pyrI rho Ssra recA Glyceraldehyde_3-phosphate_dehydrogenase_A_(GAPDH-A) cI DNA_polymerase_III_alpha_subunit Diacylglycerol_kinase_(DAGK)_(Diglyceride_kinase)_(DGK) Probable_RNA_polymerase_sigma_factor_fecI NagD_protein Enolase_(2-phosphoglycerate_dehydratase)_(2-phospho-D-glycerate_hydro-lyase) lexA Hypothetical_protein_cusX_precursor Putative_copper_efflux_system_protein_cusB_precursor Transcriptional_regulatory_protein_dcuR Topoisomerase_IV_subunit_A Preprotein_translocase_secA_subunit Osmolarity_sensor_protein_envZ DNA_replication_protein_dnaC TonB_protein GTP-binding_protein_era Protein_yfhF BirA_bifunctional_protein_[Includes:_Biotin_operon_repressor;_Biotin-[acetyl-CoA-carboxylase]_synthetase_(Biotin-protein_ligase)] Nitrogen_regulation_protein_NR(II) Serine_acetyltransferase_(SAT)Sensor_protein_barA Protein_fecR DinI mucA pyrI Translation_initiation_factor_IF-3 Cryptic_beta-glucoside_bgl_operon_antiterminator Arsenical_resistance_operon_repressor Cys_regulon_transcriptional_activator PTS_system,_beta-glucoside-specific_IIABC_component_(EIIABC-BGL)_(Beta-glucoside-permease_IIABC_component)_(Phosphotransferase_enzyme_II,_ABC_component)_(EII-BGL) UvrY_protein mucB Protease_do_precursor Cyanate_hydratase_(Cyanase)_(Cyanate_lyase)_(Cyanate_hydrolase) Sulfate_adenylyltransferase_subunit_1_(Sulfate_adenylate_transferase)_(SAT)_(ATP-sulfurylase_large_subunit) Transcriptional_repressor_cytR Sensor_protein_evgS_precursor Transcriptional_regulatory_protein_ompR Cytochrome_c-type_biogenesis_protein_ccmF Iron(III)_dicitrate_transport_protein_fecA_precursor Probable_outer_membrane_lipoprotein_cusC_precursor Arginine_repressor Biopolymer_transport_exbB_protein ssb Glutamine_synthetase_(Glutamate-ammonia_ligase) Positive_transcription_regulator_evgA Phospho-2-dehydro-3-deoxyheptonate_aldolase,_Tyr-sensitive_(Phospho-2-keto-3-deoxyheptonate_aldolase)_(DAHP_synthetase)_(3-deoxy-D-arabino-heptulosonate_7-phosphate_synthase) MazG_protein Sulfate_adenylyltransferase_subunit_2_(Sulfate_adenylate_transferase)_(SAT)_(ATP-sulfurylase_small_subunit) DNA-directed_RNA_polymerase_beta’_chain_(Transcriptase_beta’_chain)_(RNA_polymerase_beta’_subunit) Sensor_protein_dcuS DNA_polymerase_III,_epsilon_chain Protease_degQ_precursor Thiol:disulfide_interchange_protein_dsbC_precursor Thiol:disulfide_interchange_protein_dsbD_precursor_(C-type_cytochrome_biogenesis_protein_cycZ)_(Inner_membrane_copper_tolerance_protein) def Cytochrome_c-type_biogenesis_protein_ccmE DNA_polymerase_III,_theta_subunit 50S_ribosomal_protein_L34 Chaperone_protein_hscA_(Hsc66) Thioredoxin_1_(TRX1)_(TRX) http://www.cs.purdue.edu/homes/koyuturk/mawish/ Regulator_of_sigma_D Arabinose_operon_regulatory_protein Aerotaxis Application: E. Coli protein-protein network major_coat_protein_-_phage_Pf3 Cell_division_protein_ftsK TraV_protein_precursor 60_kDa_inner-membrane_protein Tryptophan_synthase_beta_chain Septum_site-determining_protein_minC Probable_ammonium_transporter TraK_protein Putative_ATP-binding_component_of_a_transport_system parE Septum_site-determining_protein_minD_(Cell_division_inhibitor_minD) Nitrogen_regulatory_protein_P-II_1 L-asparaginase_II_precursor_(L-asparagine_amidohydrolase_II)_(L-ASNase_II)_(Colaspase) Ribonuclease_G_(RNase_G)_(Cytoplasmic_axial_filament_protein UTP-glu ATP-dependent_Clp_protease_ATP-binding_subunit_clpX Hemolysin_secretion_protein_D,_chromoso Methyl-accepting_chemotaxis_protein_II_(MCP-II)_(Aspartate_chemoreceptor_protein) 30S_ribosomal_prot zipA Hypothetical_39.8_kDa_protein_in_OSMY-DEOC_intergenic_region_(O357 TraB_protein Phosphate_transport_ATP-binding_protein_pstB Cell_Division_Protein_Ftsz cell division 50S_ribosomal_protein_L7/L DNA-binding_ Polyribonucleotide_nucleo Capsular_sy flagellar_m http://www.cs.purdue.edu/homes/koyuturk/mawish/ Outline • Community detection • Generative model for modular networks • Variational Bayesian inference • Validation • Applications Application: APS March Meeting 2008 co-authorship http://meetings.aps.org/Meeting/MAR08 Conclusions • Phrased community detection as Bayesian inference • Variational methods give interpretable, fast, and scalable method for (approximate) inference • Empirical validation of model selection and comparison for constrained and full stochastic block models • Some correlation between topological communities and node attributes, but unclear without additional information Acknowledgements • Wiggins Lab • Useful discussions • Andrew Mugler • Edo Airoldi (Princeton) • Anil Raj • Joel Bader (John Hopkins) • Chris Wiggins • David Blei (Princeton) • Jon Bronson • Aaron Clauset (SFI) • Yahoo! Research • Duncan Watts • Jonathan Goodman (NYU) • Matt Hastings (LANL)