How Powerful are Graph Neural Networks Jure Leskovec Tasks on Networks Classical ML tasks in networks: § Node classification § Predict a type of a given node § Link prediction § Predict whether two nodes are linked § Community detection § Identify densely linked clusters of nodes § Network similarity § How similar are two (sub)networks Jure Leskovec, Stanford University 2 ML on Graphs ? ? ? ? ? Machine Learning Example: Node Classification Many possible ways to create node features: § Node degree, PageRank score, Motifs, Degree of neighbors, Clustering, … Jure Leskovec, Stanford University 3 Deep Learning in Graphs … z Input: Network Predictions: Node labels, New links, Generated graphs and subgraphs Jure Leskovec, Stanford University 4 Why is it Hard? § Modern deep learning toolbox is designed for simple sequences or grids § CNNs for fixed-size images/grids…. § RNNs or word2vec for text/sequences… Jure Leskovec, Stanford University 5 Why is it Hard? But networks are far more complex! § Arbitrary size and complex topological structure (i.e., no spatial locality like grids) vs. Text Networks Images § No fixed node ordering or reference point § Often dynamic and have multimodal features Jure Leskovec, Stanford University 6 Setup We have a graph !: § " is the vertex set § # is the (binary) adjacency matrix '×|*| § $∈ℝ is a matrix of node features § Meaningful node features: § Social networks: User profile § Biological networks: Gene expression profiles, gene functional information Jure Leskovec, Stanford University 7 Graph Neural Networks Idea: Node’s neighborhood defines a computation graph ! Determine node computation graph ! Propagate and transform information Learn how to propagate information across the graph to compute node features The Graph Neural Network Model. Scarselli et al. IEEE Transactions on Neural Networks 2005 Semi-Supervised Classification with Graph ConvolutionalJure Networks. N. Kipf,University M. Welling, ICLR 2017 Leskovec,T.Stanford 8 Graph Neural Networks A C B B TARGET NODE A A C B A C E F D F E D INPUT GRAPH A Each node defines a computation graph § Each edge in this graph is a transformation/aggregation function Inductive Representation Learning on Large Graphs. W. Hamilton, R. Ying, J. Leskovec. NIPS, 2017. Jure Leskovec, Stanford University 9 Graph Neural Networks A C B B TARGET NODE A A C B A C E F D F E D A INPUT GRAPH Neural networks Intuition: Nodes aggregate information from their neighbors using neural networks Inductive Representation Learning on Large Graphs. W. Hamilton, R. Ying, J. Leskovec. NIPS, 2017. Jure Leskovec, Stanford University 10 Idea: Aggregate Neighbors Intuition: Network neighborhood defines a computation graph Every node defines a computation graph based on its neighborhood! Can be viewed as learning a generic linear combination of graph low-pass and high-pass operators Jure Leskovec, Stanford University 11 [NIPS ‘17] Our Approach: GraphSAGE !( #' ∈ ) *ℎ% ) xA xC >… combine ?… aggregate ? > Update for node +: (-./) ℎ, 1<= =1 2 8+ level embedding of node * - # ℎ,- , 4 %∈) , Transform *’s own embedding from level 8 1(5 - ℎ%- ) Transform and aggregate embeddings of neighbors 9 6 § ℎ, = attributes 7, of node * Jure Leskovec, Stanford University 12 [NIPS ‘17] GraphSAGE: Training § Aggregation parameters are shared for all nodes § The number of model parameters is sublinear in |V| § Can use different loss functions: * § Classification/Regression: ℒ ℎ% = '% − ) ℎ% § Pairwise Loss: ℒ ℎ% , ℎ, = max(0, 1 − 3456 ℎ% , ℎ, ) shared parameters B A 8 (9) , : (9) C F D shared parameters E INPUT GRAPH Compute graph for node A Jure Leskovec, Stanford University Compute graph for node B 13 [NeurIPS ‘18] Pooling for GNNs Don’t just embed individual nodes. Embed the entire graph. Problem: Learn how to hierarchical pool the nodes to embed the entire graph Our solution: DIFFPOOL § Learns hierarchical pooling strategy § Sets of nodes are pooled hierarchically § Soft assignment of nodes to next-level nodes Hierarchical Graph Representation LearningJure with Differentiable Pooling. R. Ying, et al. NeurIPS 2018. Leskovec, Stanford University 14 DIFFPOOL Architecture A different GNN is learned at every level of abstraction Our approach: Use two sets of GNNs § GNN1 to learn how to pool the network § Learn cluster assignment matrix § GNN2 to learn the node embeddings Hierarchical Graph Representation Learning with Differentiable Pooling. R. Ying, et al. NeurIPS 2018. Jure Leskovec, Stanford University 15 DIFFPOOL Architecture A different GNN is learned at every level of abstraction Next: Many successful extensions and domain Our approach: Use two sets of GNNs adaptations § GNN1 to learn how to pool the network § Learn cluster assignment matrix § GNN2 to learn the node embeddings Hierarchical Graph Representation Learning with Differentiable Pooling. R. Ying, et al. NeurIPS 2018. Jure Leskovec, Stanford University 16 Application (1): Pinterest Task: Recommend related pins tio n An no ta Vi su al N C G N § Challenges: 0.4 0.3 0.2 0.1 0 RW -G C Source pin Mean Reciprocal Rank Task: Learn node embeddings !" s.t. # !$%&'(, !$%&'* < #(!$%&'(, !-.'%/'0 ) § Massive size: 3 billion nodes, 20 billion edges § Heterogeneous data: Rich image and text features Graph Convolutional Neural Networks for Web-Scale Recommender Systems. R. Ying, et al. KDD, 2018. 8 (2) Drug Side Effects § Task: Given a pair of drugs predict adverse side effects 46% of people ages 70-79 take >5 drugs Proteins Drugs § Link prediction on a multimodal graph 36% improvement in AP@50 over state of the art Modeling Polypharmacy Side Effects with GraphJureConvolutional Networks. M. Zitnik, et al. Bioinformatics, 2018.18 Leskovec, Stanford University [NeurIPS ‘18] (3) Targeted Molecule Generation Goal: Generate molecules that optimize a given property (Quant. energy, solubility) Solution: Combination of § Graph representation learning § Adversarial training § Reinforcement learning Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. Jure Leskovec, Stanford University J. You, et al., NeurIPS 2018. 19 [NeurIPS ‘18] (4) Knowledge Graphs Embedding Logical Queries on Knowledge Graphs. W. Hamilton, et al. NeurIPS, 2018. Jure Leskovec, Stanford University 20 How expressive are GNNs? Theoretical framework: Characterize GNN’s discriminative power by analyzing the discriminative power of the multiset function: § Characterize upper bound of the discriminative power of GNNs § Propose a maximally powerful GNN § Characterize discriminative power of popular GNNs that use non-injective multiset functions How Powerful are Graph Neural Networks? K. Xu, et al. ICLR 2019. Jure Leskovec, Stanford University 21 Discriminative Power of GNNs !( #' ∈ )(*)ℎ% ) +… combine ,… aggregate Theorem: The most discriminative GNN uses injective multiset function for neighbor aggregation If the aggregation function is injective, GNN can fully capture/distinguish the rooted subtree structures: Injective Jure Leskovec, Stanford University 22 Discriminative Power of GNNs Theorem: GNNs can be at most as powerful as the Weisfeiler-Lehman graph isomorphism test (a.k.a. canonical labeling or color refinement) What GNN achieves this upper bound? Jure Leskovec, Stanford University 23 Injective functions over multiset Idea: If GNN functions are injective, GNN can capture/distinguish the rooted subtree structures Theorem: Any injective multiset function can be expressed as ! ∑#∈% &(() § Solution: Model !() and &() with MLP Consequence: Sum aggregator is the most expressive How Powerful are Graph Neural Networks? K. Xu, et al. ICLR 2019. Jure Leskovec, Stanford University 24 Expressive Power of GNNs Many popular aggregators: Under review as a conference paper at ICLR 2019 Under review as a conference paper at ICLR 2019 (MEAN {f (x) : x 2 S}) cannot distinguish: A >vs.> A > > Theorem (Informal): sum - multiset mean - distribution max - set sum - multiset mean - distribution max - se Input Mean aggregatorInput only captures distribution information Figure 2: Ranking by expressive power for sum, mean and max-pooling aggregators over a multis der review Under as a conference paper at ICLR 2019 Figure Ranking by expressive power for sum, mean and max-pooling review as aLeft conference paper ICLR 2019and panel2:shows theat input multiset the three panels illustrate the aspects of aggregators the multiset aover give Left panel shows the input multiset and the three panels illustrate the aspects of the multi is able to capture: sum captures the full multiset, mean captures the proportion/distributio (MAX {f aggregator (x) : x S}) aggregator totype, capture: summax captures the full multiset, mean captures thethe proportion/d cannot distinguish: of elements ofis2 aable given and the aggregator ignores multiplicities (reduces multiset to of elements simple set). of a given type, and the max aggregator ignores multiplicities (reduces the m simple set). > >A > > vs. A vs. vs. Theorem (Informal): vs. sum multiset mean distribution max set Input vs. vs.- multiset vs. sum mean - distribution max - set Input Max aggregator only captures set information (no multiplicity) and Maxpower both fail Max fails (c) Mean and Max both fai Figure 2: Ranking (a) byMean expressive for sum, mean and(b)max-pooling aggregators over a multiset. Jure Leskovec, Stanford University 25 > > Power of Aggregators Input sum - multiset mean - distribution max - set Figure 2: Ranking by expressive power for sum, mean and max-pooling aggregators over a multiset. Left panel shows the input multiset and the three panels illustrate the aspects of the multiset a given aggregator is able to capture: sum captures the full multiset, mean captures the proportion/distribution of elements of a given type, and the max aggregator ignores multiplicities (reduces the multiset to a simple set). Failure cases for mean and max agg. A vs. A (a) Mean and Max both fail A vs. A (b) Max fails A vs. A (c) Mean and Max both fail Under Figure review 3: as Examples a conference paper atgraph ICLR 2019 that mean and max-pooling aggregators fail to of simple structures Ranking by discriminative power distinguish. Figure 2 gives reasoning about how different aggregators “compress” different graph structures/multisets. existing GNNs instead use a 1-layer perceptron W (Duvenaud et al., 2015; Kipf & Welling, 2017; A Zhang etAal., 2018), a linear mapping followed by a non-linear activation function such A A as a ReLU. Such 1-layer mappings are examples of Generalized Linear Models (Nelder & Wedderburn, 1972). Therefore, we are interested in understanding whether 1-layer perceptrons are enough for graph learning. Lemma 7 suggests that there are indeed network neighborhoods (multisets) that models sumdistinguish. - multiset mean - distribution max - set Inputperceptrons can never with 1-layer > Jure Leskovec, Stanford University > 26 Summary § Graph Convolution Networks § Generalize beyond simple convolutions § Fuses node features & graph info § State-of-the-art accuracy for node classification and link prediction. § Model size independent of graph size; can scale to billions of nodes § Largest embedding to date (3B nodes, 17B edges) § Leads to significant performance gains Jure Leskovec, Stanford University 27 Conclusion Results from the past 2-3 years have shown: § Representation learning paradigm can be extended to graphs § No feature engineering necessary § Can effectively combine node attribute data with the network information § State-of-the-art results in a number of domains/tasks § Use end-to-end training instead of multi-stage approaches for better performance Jure Leskovec, Stanford University 28 Future Work § Understand other/new GNN variants § Understand optimization and generalization of GNNs § Towards more powerful DL on graphs § GNNs can only distinguish rooted trees: Jure Leskovec, Stanford University 29 References § Tutorial on Representation Learning on Networks at WWW 2018 http://snap.stanford.edu/proj/embeddingswww/ § Inductive Representation Learning on Large Graphs. W. Hamilton, R. Ying, J. Leskovec. NIPS 2017. § Representation Learning on Graphs: Methods and Applications. W. Hamilton, R. Ying, J. Leskovec. IEEE Data Engineering Bulletin, 2017. § Graph Convolutional Neural Networks for Web-Scale Recommender Systems. R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, J. Leskovec. KDD, 2018. § Modeling Polypharmacy Side Effects with Graph Convolutional Networks. M. Zitnik, M. Agrawal, J. Leskovec. Bioinformatics, 2018. § Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. J. You, B. Liu, R. Ying, V. Pande, J. Leskovec, NeurIPS 2018. § Embedding Logical Queries on Knowledge Graphs. W. Hamilton, P. Bajaj, M. Zitnik, D. Jurafsky, J. Leskovec. NeuIPS, 2018. § How Powerful are Graph Neural Networks? K. Xu, W. Hu, J. Leskovec, S. Jegelka. ICLR 2019. § Code: § § § § § http://snap.stanford.edu/graphsage http://snap.stanford.edu/decagon/ https://github.com/bowenliu16/rl_graph_generation https://github.com/williamleif/graphqembed https://github.com/snap-stanford/GraphRNN Jure Leskovec, Stanford University 30 Industry Partnerships PhD Students Alexandra Porter Camilo Ruiz Claire Donnat Emma Pierson Funding Jiaxuan You Mitchell Gordon Mohit Tiwari Rex Ying Post-Doctoral Fellows Collaborators Baharan Marinka Mirzasoleiman Zitnik Michele Catasta Robert Palovics Adrijan Bradaschia Rok Sosic Research Staff Srijan Kumar Dan Jurafsky, Linguistics, Stanford University Christian Danescu-Miculescu-Mizil, Information Science, Cornell University Stephen Boyd, Electrical Engineering, Stanford University David Gleich, Computer Science, Purdue University VS Subrahmanian, Computer Science, University of Maryland Sarah Kunz, Medicine, Harvard University Russ Altman, Medicine, Stanford University Jochen Profit, Medicine, Stanford University Eric Horvitz, Microsoft Research Jon Kleinberg, Computer Science, Cornell University Sendhill Mullainathan, Economics, Harvard University Scott Delp, Bioengineering, Stanford University Jens Ludwig, Harris Public Policy, University of Chicago Jure Leskovec, Stanford University 31