JOURNAL OF COMPUTATIONAL BIOLOGY Volume 28, Number 12, 2021 # Mary Ann Liebert, Inc. Pp. 1196–1207 DOI: 10.1089/cmb.2021.0069 GoVec: Gene Ontology Representation Learning Using Weighted Heterogeneous Graph and Meta-Path Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. ESMAEIL NOURANI1,*,i ABSTRACT Biomedical knowledge graphs are crucial to support data-intensive applications in the life sciences and health care. These graphs can be extended by generating a heterogeneous graph that contains both ontology terms and biomedical entities. However, state-of-the-art approaches for Gene Ontology representation learnings are constrained to homogeneous graphs that cannot represent different node types and relations. To address this limitation, we present GoVec to produce representations seamlessly for both ontologies and biological entities by utilizing meta-path-based representation learning in the heterogeneous graph. The resulting vectors can be used in many bioinformatics applications, particularly for calculating semantic similarity and extracting relations among biological entities. We verify the approach’s usefulness by comparing the resulting semantic similarities with the manually produced similarities by the experts. Furthermore, the superiority of the GoVec is shown by an extensive set of quantitative and qualitative evaluations. Two downstream tasks, including protein–protein interaction and protein family similarity, are evaluated in comparison with many state-of-the-art approaches. Finally, as a qualitative visual representation, the separability of various protein families is examined and visually separable groups of proteins are generated, which shows the capability of GoVec representations to embed functional semantics into the vectors. Keywords: Gene Ontology, heterogeneous graph, meta-path, representation learning. 1. INTRODUCTION B iological knowledge captures various aspects of biological phenomena. Nowadays, they are represented in formal and structured biomedical ontologies, thanks to the many years of research (Bodenreider, 2008). Ontology annotations can be used for categorizing activities and associations of biological entities (Smith et al., 2007). Annotating is performed by associating specific terms of the ontology along with the evidence meta-data to the biological entities such as genes and gene products (Hill et al., 2008; Smaili et al., 2018). 1 Department of Information Technology, Faculty of Computer Engineering and Information Technology, Azarbaijan Shahid Madani University, Tabriz, Iran. *Current address: Novo Nordisk Foundation Center for Protein Research, The Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark. i ORCID ID (https://orcid.org/0000-0003-1933-2550). 1196 Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1197 Specifically, Gene Ontology (GO) annotates gene products employing a set of structured vocabularies (Harris et al., 2004). GO vocabularies consist of three different categories: cellular component (CC), biological process (BP), and molecular function (MF), which are formed as a directed acyclic graph. Nodes of the graph are called terms that are related to other terms mostly based on a parent–child relationship. There is a complex hierarchy in the graph since every node may have multiple parents and more than a child. Biological knowledge graphs and ontology-based annotations are crucial for data science-based applications in the life sciences and health care. Comparison of biological entities in the graph can be performed based on their ontological descriptions. The extracted information can be used for finding similarities between biomedical entities based on computing semantic similarity. Consequently, this knowledge can be used in many applications such as protein–protein interaction (PPI) prediction and discovering associations between diseases and genes. GO (Ashburner et al., 2000) as the most widely used knowledge graph has attracted tremendous attention in bioinformatics applications. Semantic similarity computation between GO terms has attracted many research attentions. Most of the pioneered studies estimate the similarity based on information content (IC) (Lin, 1998; Resnik, 1999). However, statistics about the common ancestors in the graph are utilized in many studies (Lord et al., 2003; Sevilla et al., 2005; Yang et al., 2012; Teng et al., 2013; Song et al., 2014). Produced semantic similarity can be used as the feature for computational approaches in many applications such as drug repositioning (Gottlieb et al., 2011) or extracting genomic variants (Robinson et al., 2014; Boudellioua et al., 2017). The commonality of semantic similarity-based methods is that generated features cannot include the ontology structure and consequently, this information will be available to the machine learning method. Besides developing the semantic-similarity-based approaches for the past two decades, ontology-based annotations are also used in bioinformatics studies. In this approach, one-hot binary vectors are generated representing if an entity is annotated with a term or not (Sokolov et al., 2013). In this case, ontology is used to generate the binary feature vectors, but the structure is not available implicitly or explicitly to the machine learning approach. Therefore, producing feature vectors that directly encode both the structure and annotations of entities can significantly outperform the previous approaches. Recently word embedding-based approaches are used for this reason (Smaili et al., 2018, 2019; Duong et al., 2019; Zhong et al., 2019) based on similar ideas from natural language processing. Latent features produced by these approaches implicitly encode for the structure of the ontology and consequently are used for the calculation of similarity between biological entities. Nowadays, the concept of heterogeneous graph representation learning is at the center of attention (Dareddy et al., 2019; Wang et al., 2019; Zhang et al., 2019; Molaei et al., 2020) because of their abilities to overcome the constraint of the traditional network embedding methods by supporting multiple types of nodes and links among them. Embedding-based approaches for ontology graph can be significantly extended to contain a diverse set of entities and ontologies by utilizing the recently introduced concept of heterogeneous graph representation learning called metapath2vec (Dong et al., 2017). We present GoVec to jointly generate representation vectors for ontology terms and biological entities based on a weighted heterogeneous graph and meta-path (Dong et al., 2017). GoVec can be significantly extended in future studies to support more ontologies and entities. However, in this study, we just propose the applicability of utilizing the potentials of heterogeneous graphs in favor of generating richer and flexible representation. As a case study, we produce vectors for GO terms along with proteins, but the approach can be easily extended to contain more entities and their relations. 2. METHODS In this study, we present a novel representation learning method called GoVec for heterogeneous knowledge graphs, specifically for GO. Table 1 shows the statistics of the employed GO. Recent state-of-the-art approaches for GO representation learning (Smaili et al., 2018; Zhong et al., 2019) suffer from serious limitations. The first issue is transforming the directed GO graph to an undirected graph that might lead to the loss of structural information. The second problem is considering different relations as a similar edge in the graph. For instance, relations among terms are different than term-protein annotations, but both are equally considered as a regular edge. 1198 NOURANI Table 1. Statistics of the Used Gene Ontology Category BP MF CC Total Is_a Negatively_regulates Part_of Positively_regulates Node/edge Count Node Node Node Node Edge Edge Edge Edge 29,457 11,093 4183 44,733 75,161 2822 7544 2799 Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. BP, biological process; CC, cellular component; MF, molecular function. To address these challenges, we utilize heterogeneous network embedding based on meta-path, which allows the approach to distinguish between various node types and edge types. We emphasize the usefulness of having various node types in the graph by seamlessly considering different ontology types and biological entities. As a simple case study, we only consider proteins as the biological entity along with different types of GO terms and the various relations among them to form a heterogeneous graph. A similar approach can be easily considered for different biomedical terms such as chemicals, disease, and phenotypes. In this study, we utilize three categories of source data to generate the heterogeneous graph. The first and most important input is the GO graph containing different term types and relations among them. The edge weight for all edges of the base GO graph in the first step is the maximum value, which is 1. Figure 1 shows the ancestor graph for the term GO:0022857 (transmembrane transporter activity). Most of the relation types in the graph are the parent–child relation, which is called ‘‘is-a,’’ but there are other relation types such as ‘‘part-of,’’ which can be considered as the connection between different categories of the GO graph. As shown in Figure 1 the base leaf term, which is of type MF, is connected to a BP term using the ‘‘Part-of’’ relation. FIG. 1. GO graph ancestor chart for GO:0022857 (adapted from https://www.ebi.ac.uk/QuickGO/term/GO:0022857). GO, Gene Ontology. Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1199 The second input for generating the graph is the list of co-occurrences extracted from QuickGo (Huntley et al., 2009, 2015). For each term, we have extracted the list of GO terms that are commonly co-annotated to a gene product. Between every pair of co-occurred terms, we add an edge to the GO graph of the first step. It is important to note that for every term there is a sorted list of terms based on a score that shows the significance of co-occurrence. Therefore, for each compared term, an overlap significance is computed based on the ratio of proteins annotated by both terms to the number of all proteins annotated by at least one of them. We use the overlap significance score as the weight for the edge between the co-occurred terms in the graph. This score is a value between 0 and 100, which is scaled to 0 and 1 when used as the edge weight. The third input to the graph is the list of annotated proteins for each of the GO terms. There will be an edge between every annotated protein and the GO term. Weights of the edges are normalized based on the number of terms for every protein. That is to say that if 5 terms are annotating a protein, by dividing the 1 by the number of annotating terms, edge weight for the 5 generated edges will be equal to 0.2. It is important to note that node types for proteins and GO terms are not the same and this will be considered for defining paths when metapath2vec is applied. Figure 2 illustrates details of the proposed approach to generate representations for GO terms and proteins. Since an integrated heterogeneous graph is used as the input to the embedding method, representations are generated jointly for different node types, including GO terms and proteins. 2.1. Using meta-path-based random walks to generate representations To produce representations we use the metapath2vec approach (Dong et al., 2017). The advantage of this approach in comparison with traditional word2vec-based (Mikolov et al., 2013) approaches such as node2vec (Grover and Leskovec, 2016), LINE (Tang et al., 2015), and DeepWalk (Perozzi et al., 2014) is the ability to consider multiple types of nodes and links, which is the case we have in this study and can be easily extended to other domains. The conventional network embedding approach is constrained to a homogeneous graph that has the same node type and edge type. However, most real-world applications require more flexibility to capture various entities and relation types. metapath2vec proposes a heterogeneous edition of the skip-gram model. Skip-gram has employed in all of the word-embedding-based approaches for generating representations using random walks. In the heterogeneous meta-path-based random walks, there is no bias toward highly visible types of nodes. Meta-path generates paths that are capable of extracting both semantic and structural correlations among various types of nodes. FIG. 2. The workflow of the presented GoVec method for representation learning. 1200 NOURANI 2.1.1. Heterogeneous network representation learning. A heterogeneous network is defined as a graph G = (V,E,T), where node v and edge e may have different types based on: /ðtÞ : V ! TV and uðtÞ : E ! TE respectively. TV and TE are node types and edge types, where jTV j + jTE j > 2. In this study, GO terms and proteins are two different node types but can be easily extended to other biomedical entities. Given a heterogeneous network G, metapath2vec produces d dimensional vector representations XRjV j · d ‚ d jV j, which captures structural and semantic relations. Where representation of each node v is a d dimensional vector Xv in a unified latent space. Metapath2vec enables the skip-gram model to produce effective representations for a heterogeneous network by maximizing the probability of having the heterogeneous context Nt ðvÞ‚ t 2 TV given a node v: XX X log pðct jv; tÞ‚ (1) arg max Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. h v2V t2TV ct2 Nt ðvÞ where Nt ðvÞ is the neighborhood of node v with the tth node type and pðct jv; tÞ defined as a softmax Xct :Xv function: pðct jv; tÞ = P e eXu :Xv where Xv is the vth row of X, which is the embedding vector of node v. R1 u2V R2 Rt Rl - 1 A meta-path scheme is presented as a path in the form of q = V1 ! V2 ! . . . Vt ! Vt + 1 ! . . . ! Vl , which ends to the same starting node that is aimed to generate a representation for it. R is the relation between nodes V. As a concrete example, if we consider three node types, including GO terms, proAnnotated By IS - A tein domains, and proteins, we can design a metapath as follows: Protein ! GO Term ! Appear - in Annotates Contains GO Term ! Protein ! Domain ! Protein. Previous studies have confirmed the usefulness of meta-paths in hseterogeneous networks (Sun and Han, 2012; Sun et al., 2012; Dong et al., 2015). As shown in Figure 2, we have defined various meta-paths that cover all node types in the graph, including GO terms and proteins. For example, the last meta-path in Figure 2 defines how two proteins can be related to each other based on a common GO term. Equation 2 shows how meta-paths can guide heterogeneous random walkers by modifying transition probability p from node v denoted by vit , where i is the step number and t is the node type: i + 1 i 8 1 v ‚ v 2 E‚ /ðvi + 1 Þ = t + 1 i Þj < N ð v j t + 1 i+1 i t i + 1 it i+1 : (2) p v jvt ‚ q = i +v1 i‚vt 2 E‚ /ðv Þ 6¼ t + 1 : 0 = E 0 v ‚ vt 2 In Equation 2, vit 2 Vt and Nt + 1 vit are the Vt+1 type of neighborhood of the node vit . That is to say that random walks are based on predefined meta-paths. Meta-path-based random walk properly feeds the semantic connections among various node types to the skip-gram model, where the final representation will be generated. Table 2 shows the utilized parameters for the word2vec model for generating representations. To compute the semantic similarity between the vectors of two proteins we use cosine similarity. Produced scores using GoVec representations will be compared with the quoted precomputed semantic similarity measures from the benchmark data sets for every protein pair. 3. RESULTS AND DISCUSSION To evaluate the presented approach, we have conducted an extensive set of experiments from comparing the results with manually analyzed ontologies by the expert to evaluating the method over benchmark data sets and showing the superiority of GoVec in comparison with other approaches. Previous studies mostly Table 2. Used Parameters for Training the Model Parameter name Sg Length N D Window Definition GoVec value Selecting the training algorithm between skip-gram (sg = 1) and CBOW (sg = 0) The maximum length of a random walk Number of random walks per root node Embedding size Context window size 1 50 20 128 5 CBOW, continuous bag of words model. Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1201 evaluated their methods in terms of semantic similarity using collaborative evaluation of semantic similarity measures (CESSM) webtool (Pesquita et al., 2009). Unfortunately, this benchmark server is not currently available and, in this study, we utilize a comprehensive and most recently introduced benchmark (Cardoso et al., 2020), which is designed for knowledge graph-based similarity in the biomedical domain. This benchmark covers different species and, in this study, human data are considered in all experiments. This benchmark introduces reference proxy measures for similarity calculated based on protein MF, and PPIs. Furthermore, they have provided semantic similarity computations with state-of-the-art representative measures, for a comparative evaluation of the measures. Benchmark data set provides two sets of protein pairs for every candidate species and utilizes both breadth and depth class annotations. In the first set (One aspect), proteins of every pair should be annotated by at least one of the GO aspects, whereas there is at least one leaf term among annotating terms. In the second set (all aspects), proteins of every pair should have at least one annotation in each GO aspect, and within the annotating terms, there should be at least one leaf term of every aspect. Table 3 summarizes the statistics of the employed data sets from the benchmark, which are used for evaluating the performance of GoVec in the following sections. 3.1. Semantic similarity calculation methods GoVec is compared with four representative semantic similarity measures (SSMs) (Cardoso et al., 2020). Each of these measures is designed using a combination of two methods: the first method used to compute the IC of an annotating class (GO term; ICSeco or ICResnik) and the IC-based method to compute the similarity between the biomedical entities (simGIC or Best Match Average [BMA]). BMA and simGIC as IC-based entity similarity methods are high-performance classical measures for semantic similarity, which are widely accepted in the research community, unlike the new structure-based measures. ICSeco, introduced by Seco et al. (2004) is computed according to the number of children of class c denoted by hðcÞ: ICSeco ðcÞ = 1 - logðhðcÞ + 1Þ ‚ logðN Þ (3) where N is the total number of classes in the ontology. ICResnik is presented by Resnik (1999) according to the entities annotated using class c in the GO graph: ICResnik = - log pðcÞ‚ (4) where pðcÞ is the probability of annotation in the corpus. BMA is a pairwise measure in which the similarity of two classes is computed based on the common ancestor of a pair (Resnik, 1999). For every class, BMA considers the most similar class: P P simðc2 ‚ c1 Þ c1 2CA simðc1 ‚ c2 Þ + c22CB ‚ (5) BMAðA‚ BÞ = 2j C A j 2jCB j where A and B are entities (i.e., protein), C is the set of classes c that annotates each of the entities, and sim(c1, c2) returns the highest similarity values for the classes. The similarity between the two classes is computed by Resnik similarity: simðc1 ‚ c2 Þ = maxðIC ðaÞÞ : a 2 Aðc1 Þ \ Aðc2 Þ‚ (6) Table 3. Number of Protein Pairs in the Two Data Sets of the Benchmark for Each of the Reference Proxy Measures Proxy measure One aspect/all aspects Number of protein pairs Protein family One aspect All aspects One aspect All aspects 31,350 25,527 30,826 29,672 PPI PPI, protein–protein interaction. 1202 NOURANI where a is a class in the ancestors’ set of ci [i.e., A(ci)]. SimGIC (Pesquita et al., 2008) is similar to Jaccard similarity and every class is weighted by its IC: P IC ðcÞ : (7) SimGIC ðA‚ BÞ = Pc2CA \ CB c2CA [ CB IC ðcÞ To sum up, four state-of-the-art methods to compute the semantic similarity are utilized in the benchmark: BMAResnik, BMASeco, simGICResnik, and simGICSeco by combing IC computation methods (ICSeco or ICResnik) and entity similarity computation methods (simGIC or BMA). In the following sections, the results of comparing GoVec with these four methods are discussed. Semantic similarity results of these methods are directly extracted from the benchmark (Cardoso et al., 2020) for every protein pairs of the data sets, but the semantic similarity of the GoVec is produced using cosine similarity between two protein vectors. Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. 3.2. Protein family similarity The first proxy measure for evaluating other approaches is MF similarity, which is calculated based on comparing the common domains of the two candidate protein sequences. Functional domains of proteins are gathered from the Pfam database (El-Gebali et al., 2019). Pfam similarity (SimPfam) is calculated using Jaccard similarity, based on the number of common families: SimPfam ðA‚ BÞ = fA \ fB ‚ fA [ fB (8) where fA and fB are set of families for the two proteins. If they share more functional domains, the SSM especially in the MF aspect will be higher. Based on the result in Table 4, GoVec outperforms the other approaches. 3.3. PPI similarity The second proxy measure is designed based on PPI. In this measure, two proteins are considered to be similar if they are interacting. Correlation between the PPI and GO-based semantic similarity is reasonable when two proteins share the same cellular location and same BP. In this case, they are more likely to interact and share common CC and BP terms. This is verified in some studies to gain more accuracy in PPI prediction by using terms of these two aspects (Sousa et al., 2020). However, noninteracting proteins can be considered as similar based on different metrics. That is why we consider both PPI and MF similarity as the complementary proxy measures. Table 5 shows the results of different semantic similarity calculation methods. 3.4. Comparison with word2vec-based approaches The advantage of GoVec in comparison with traditional word2vec-based approaches such as node2vec and Onto2vec (Smaili et al., 2018) is the ability to consider multiple types of nodes and links among them. Onto2vec is a word2vec-based approach for representation generation for GO terms and accordingly proteins based on GO knowledge graph. To compare the performance of GoVec with other word2vec-based methods, we add protein domains as other entity type to the knowledge graph. For this reason, we utilize Interpro database (Hunter et al., 2009) and add protein domain associations to the graph. InterPro release 85.0 is used in this study and the statistics of the release is summarized in Table 6. Besides adding proteindomain associations, we add another relation to the graph between GO terms and Interpro domains, which is based on mappings extracted from Interpro database. Therefore, the resulting heterogenous graph contains Table 4. Pearson Correlation Coefficient Between the Pfam Similarity and Semantic Similarity Measures Computed by the Methods Method BMAResnik BMASeco simGICResnik simGICSeco GoVec PCC (one aspect) PCC (all aspects) 0.692 0.733 0.647 0.663 0.753 0.702 0.743 0.659 0.674 0.761 Boldface represents best result. BMA, Best Match Average; PCC, Pearson correlation coefficient. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1203 Table 5. Pearson Correlation Coefficient Between Protein–Protein Interaction (Binary Value) and Semantic Similarity Computed by the Methods for Every Pair of Proteins Method BMAResnik BMASeco simGICResnik simGICSeco GoVec PCC (one aspect) PCC (all aspects) 0.489 0.505 0.379 0.389 0.521 0.493 0.509 0.380 0.390 0.526 Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. Boldface represents best result. three types of entities, including proteins, GO terms, protein domains, and associations between them. As the result we can produce more insightful meta-paths in the graph and generate representations for proteins based on both associated GO terms and Interpro domains. As the marginal achievement, now we can generate representations for Interpro domains besides providing representation for GO terms and proteins. Considering that both Onto2vec and GoVec are word2vec-based approaches, to have a fare comparison in terms of hyperparameters, GO graph, and benchmark data set, we train and evaluate a word2vec model based on onto2vec’s approach but using the same hyperparameters, same GO graph, and benchmark data set of the GoVec. As it is shown in Table 7, we provide several versions of GoVec based on contained entity types. The first version only contains GO terms and proteins and corresponding associations types, whereas the next version contains three types of entities, including GO terms, proteins, and domains. We checked the performance of the method without adding co-occurrences of GO terms. Results show that efficiency of adding domain nodes to the graph is more significant than adding co-occurrence terms. However, there is still performance improvement by adding co-occurrences. To show the superiority of defining meta-paths we consider an exactly similar graph containing all three node types and associations with exactly same hyperparameters for the word2vec model, but without meta-paths and only generate representations based on the approach used by Onto2vec. Results of Table 7 are promising, where we achieve best results for the version of GoVec that utilizes meta-paths and domain information in comparison with original word2vec. Furthermore, we provide GoVec results before adding domain nodes, which shows the advantage of utilizing more entity types and relation types in the graph. As it is expected, integrating domain nodes to the graph results in considerably improved performance for Pfam similarity task in comparison with PPI task. Performance of onto2vec for the Pfam similariy task is the lowest among other approaches, which can be due to the fact that domain information is not utilized in this method. Pfam similarity task is clearly associated with domain information and improved results by adding domain types to the graph is expectable, which signifies the advantage of diverse entity types in the graph. The aim of this experiment is to show the flexibility of the approach to integrate more biomedical entity types and achieve more comprehensive knowledge graph. Clearly as the future steps, other valuable node and edge types can be added such as pathways and which proteins are member of pathways can be considered as the edges in the graph. Furthermore, hierarchy of pathways will be another insightful relation and can be added as a pathway–pathway relation to the graph. 3.5. Comparison of GoVec with human ratings In this section, we evaluate the semantic similarities computed using GoVec representations with that of the semantic similarities computed by the experts. Rong et al. (2006) selected 25 pairs of GO terms that Table 6. Statistics of the Employed Interpro Database (Release 85.0 on April 2021) Entry type Count Homologous superfamily Family Domain Repeat Site Total 3285 22,891 11,218 322 917 38,633 1204 NOURANI Table 7. Comparison Between GoVec Versions and Other word2vec-Based Approaches Metric Method Onto2Vec (GO + Proteins) Node2Vec (GO + Proteins + Domains) GoVec (GO + Proteins) GoVec (GO + Proteins + Co-Occurences)) GoVec (GO + Proteins + Domains) GoVec (GO + Proteins + Domains + Co-Occurences) PCC (Pfam similarity) PCC (PPI) One aspect All aspects One aspect All aspects 0.698 0.778 0.749 0.753 0.795 0.808 0.701 0.781 0.756 0.761 0.798 0.811 0.529 0.489 0.514 0.521 0.520 0.523 0.538 0.493 0.518 0.526 0.525 0.529 Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. Boldface represents best result. GO, Gene Ontology. include various levels of similarity and asked 10 biologists to grade these pairs based on their similarity ranging from 0 to 10. An average of 10 grades for every pair is considered as the reference semantic similarity, which is compared with the semantic similarity values of the methods. In this study, we directly quote the values of compared methods from Xu et al. (2013) and compute similarity grades for these pairs using GoVec representations and finally calculate the Pearson correlation coefficients (PCCs) between these grades and human ratings. Higher PCC represents better performance and more similarity to the manual expert calculations by the biologists. Table 8 shows the results that verify the superiority of the GoVec in comparison with seven referenced methods that are extracted from Xu et al. (2013). 3.6. Clustering and visualization GoVec representations can be used for unsupervised clustering and visualizations to recognize biologically related groups. As the case study, the separability of proteins to the functional groups is evaluated. In this study, enzyme proteins employed in Onto2vec for the same visualization purpose are used. First, t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008) is used for dimensionality reduction to convert high-dimensional (128 D) GoVec vectors to the two dimensional (2D) points to be used for visualization. t-SNE is similar to principal component analysis (PCA) but can extract nonlinear patterns from the data points using probability distributions, whereas linear methods such as PCA are not able to capture nonlinear structures. To achieve a better semantic from the separated groups in the visual representation, similar to Onto2Vec, enzyme commission (EC) numbers for proteins are used as distinguishing six different labels based on the top-level EC category of every protein. As shown in Figure 3, visually separable groups have the same color (single EC top-level category), which shows the capability of GoVec representations to embed functional semantics into the vectors. As a quantitative evaluation of separability based on EC top-level category, K-means clustering is applied over the GoVec representation (k = 6) and cluster purity is calculated: Table 8. Pearson Correlation Coefficient Between Human Ratings (for Every Pair) and the Semantic Similarities Computed by the Methods Method PCC Combine (Rong et al., 2006) Lin (Lin, 1998) simUI (Gentleman, 2006) Wang (Wang et al., 2007) Resnik (Resnik, 1999) (Zhong et al., 2002) SSDD (Xu et al., 2013) GoVec 0.864 0.865 0.834 0.826 0.824 0.714 0.894 0.921 SSDD, shortest semantic differentiation distance. Boldface represents best result. Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1205 FIG. 3. t-SNE visualization of proteins (Employed in Onto2vec), using GoVec vectors based on first level EC numbers. EC, enzyme commission; t-SNE, t-distributed Stochastic Neighbor Embedding. purityðT‚ CÞ = k 1X max ck [ tj ‚ N i=0 j (9) where N is the number of proteins, T is the computed clusters, and C is the set of clusters formed based on EC classes; there are six clusters since there are six categories based on the first category of EC numbers. The resulting purity is 0.56 in comparison with 0.42 purity achieved by Onto2vec that shows significant improvements. 4. CONCLUSIONS In this study, we introduced GoVec, a novel meta-path-based representation learning for the heterogeneous GO graph. We address the limitations of state-of-the-art approaches for GO representation learnings because they are constrained to homogeneous graphs that cannot represent different node types and relations. GoVec produces representations seamlessly for both ontologies and biological entities. To evaluate the effectiveness of GoVec vectors, we computed semantic similarity between vectors of biological entities and compared the results with proxy measures, including protein MF similarity and PPI. We employed public benchmark data sets that are specifically designed for evaluating semantic similarity computation methods. GoVec results consistently outperform other state-of-the-art methods based on the conducted experiments. Besides comparison with computational approaches, we compared the produced similarity scores with the similarity values produced by human experts, which shows >90% correlation between them. Finally, as a qualitative visual representation, the separability of various protein families is examined based on top-level EC numbers. The proposed approach can be easily extended to other biomedical knowledge graphs that contain different types of entities and various relation types among them. AUTHOR DISCLOSURE STATEMENT The author declares there are no competing financial interests. FUNDING INFORMATION This research did not receive any specific grant from funding agencies in the public, commercial, or notfor-profit sectors. 1206 NOURANI Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. REFERENCES Ashburner, M., Ball, C.A., Blake, J.A., et al. 2000. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29. Bodenreider, O. 2008. Biomedical ontologies in action: Role in knowledge management, data integration and decision support. Yearb. Med. Inform. 17, 67–79. Boudellioua, I., Mahamad Razali, R.B., Kulmanov, M., et al. 2017. Semantic prioritization of novel causative genomic variants. PLoS Comput. Biol. 13, e1005500. Cardoso, C., Sousa, R.T., Köhler, S., et al. 2020. A collection of benchmark data sets for knowledge graph-based similarity in the biomedical domain. Database 2020, baaa078. Dareddy, M.R., Das, M., and Yang, H. 2019. motif2vec: Motif aware node representation learning for heterogeneous networks, 1052–1059. In 2019 IEEE International Conference on Big Data (Big Data). Presented at the 2019 IEEE International Conference on Big Data (Big Data), IEEE, Los Angeles, CA, USA. Dong, Y., Chawla, N.V., and Swami, A. 2017. metapath2vec: Scalable representation learning for heterogeneous networks, 135–144. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Presented at the KDD’17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax NS Canada. Dong, Y., Zhang, J., Tang, J., et al. 2015. CoupledLP: Link prediction in coupled networks, 199–208. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Presented at the KDD’15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Sydney, NSW, Australia. Duong, D., Ahmad, W.U., Eskin, E., et al. 2019. Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions. J. Comput. Biol. 26, 38–52. El-Gebali, S., Mistry, J., Bateman, A., et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432. Gentleman, R. 2005–6. Visualizing and distances using GO. http://www.bioconductor.org/docs/vignettes.html (2005). Gottlieb, A., Stein, G.Y., Ruppin, E., et al. 2011. PREDICT: A method for inferring novel drug indications with application to personalized medicine. Mol. Syst. Biol. 7, 496. Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks, 855–864. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16. Association for Computing Machinery, New York, NY, USA. Harris, M.A., Clark, J., Ireland, A., et al. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261. Hill, D.P., Smith, B., McAndrews-Hill, M.S., et al. 2008. Gene Ontology annotations: What they mean and where they come from. BMC Bioinformatics 9, S2. Hunter, S., Apweiler, R., Attwood, T.K., et al. 2009. InterPro: The integrative protein signature database. Nucleic Acids Res. 37, D211–D215. Huntley, R.P., Binns, D., Dimmer, E., et al. 2009. QuickGO: A user tutorial for the web-based Gene Ontology browser. Database 2009, bap010. Huntley, R.P., Sawford, T., Mutowo-Meullenet, P., et al. 2015. The GOA database: Gene Ontology annotation updates for 2015. Nucleic Acids Res. 43, D1057–D1063. Lin, D. 1998. An information-theoretic definition of similarity, 296–304. In: Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA. Lord, P.W., Stevens, R.D., Brass, A., et al. 2003. Investigating semantic similarity measures across the Gene Ontology: The relationship between sequence and annotation. Bioinformatics 19, 1275–1283. Mikolov, T., Sutskever, I., Chen, K., et al. 2013. Distributed representations of words and phrases and their compositionality, 3111–3119. In Burges, C.J.C., Bottou, L., Welling, M., et al., eds. Advances in Neural Information Processing Systems 26. Curran Associates, Inc., Red Hook, NY. Molaei, S., Zare, H., and Veisi, H. 2020. Deep learning approach on information diffusion in heterogeneous networks. Knowl. Based Syst. 189, 105153. Perozzi, B., Al-Rfou, R., and Skiena, S. 2014. DeepWalk: Online learning of social representations, 701–710. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’14. Association for Computing Machinery, New York, NY, USA. Pesquita, C., Faria, D., Bastos, H., et al. 2008. Metrics for GO based protein semantic similarity: A systematic evaluation. BMC Bioinformatics 9, S4. Pesquita, C., Pessoa, D., Faria, D., et al. 2009. CESSM: Collaborative evaluation of semantic similarity measures. JB2009 Challeng. Bioinform. 157, 190. Resnik, P., 1999. Using information content to evaluate semantic similarity in a taxonomy, 448–453. In Presented at the Proceedings of the 14th international joint conference on Artificial intelligence, Morgan Kaufmann Publishers Inc., Montreal. Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only. GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING 1207 Robinson, P.N., Kohler, S., Oellrich, A., et al. 2014. Improved exome prioritization of disease genes through crossspecies phenotype comparison. Genome Res. 24, 340–348. Rong, L., Shunliang, C., Yuanyuan, L., et al. 2006. A measure of semantic similarity between gene ontology terms based on semantic pathway covering. Progr. Nat. Sci. 16, 721–726. Seco, N., Veale, T., and Hayes, J. 2004. An intrinsic information content metric for semantic similarity in WordNet, 1089–1090. In Proceedings of the 16th European Conference on Artificial Intelligence, ECAI’04. IOS Press, NLD. Sevilla, J.L., Segura, V., Podhorski, A., et al. 2005. Correlation between gene expression and GO semantic similarity. IEEE ACM Trans. Comput. Biol. Bioinf. 2, 330–338. Smaili, F.Z., Gao, X., and Hoehndorf, R. 2018. Onto2Vec: Joint vector-based representation of biological entities and their ontology-based annotations. Bioinformatics 34, i52–i60. Smaili, F.Z., Gao, X., and Hoehndorf, R. 2019. OPA2Vec: Combining formal and informal content of biomedical ontologies to improve similarity-based prediction. Bioinformatics 35, 2133–2140. Smith, B., Ashburner, M., Rosse, C., et al. 2007. The OBO foundry: Coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25, 1251. Sokolov, A., Funk, C., Graim, K., et al. 2013. Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinformatics 14, S10. Song, X., Li, L., Srimani, P.K., et al. 2014. Measure the semantic similarity of GO terms using aggregate information content. IEEE ACM Trans. Comput. Biol. Bioinf. 11, 468–476. Sousa, R.T., Silva, S., and Pesquita, C. 2020. Evolving knowledge graph similarity for supervised learning in complex biomedical domains. BMC Bioinformatics 21, 6. Sun, Y., and Han, J. 2012. Mining heterogeneous information networks: Principles and methodologies. Synth. Lect. Data Min. Knowl. Discov. 3, 1–159. Sun, Y., Norick, B., Han, J., et al. 2012. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks, 1348. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD’12. Presented at the 18th ACM SIGKDD international conference, ACM Press, Beijing, China. Tang, J., Qu, M., Wang, M., et al. 2015. LINE: Large-scale information network embedding, 1067–1077. In Proceedings of the 24th International Conference on World Wide Web. Presented at the WWW’15: 24th International World Wide Web Conference, International World Wide Web Conferences Steering Committee, Florence Italy. Teng, Z., Guo, M., Liu, X., et al. 2013. Measuring gene functional similarity based on group-wise comparison of GO terms. Bioinformatics 29, 1424–1432. van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605. Wang, J.Z., Du, Z., Payattakool, R., Philip, S.Y., and Chen, C-F. 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics 23, 1274–1281. Wang, W., Yin, H., Du, X., et al. 2019. Online user representation learning across heterogeneous social networks, 545– 554. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. Presented at the SIGIR’19: The 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Paris, France. Xu, Y., Guo, M., Shi, W., et al. 2013. A novel insight into Gene Ontology semantic similarity. Genomics 101, 368–375. Yang, H., Nepusz, T., and Paccanaro, A. 2012. Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. Bioinformatics 28, 1383–1389. Zhang, C., Swami, A., and Chawla, N.V. 2019. SHNE: Representation learning for semantic-associated heterogeneous networks, 690–698. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. Presented at the WSDM’19: The Twelfth ACM International Conference on Web Search and Data Mining, ACM, Melbourne, VIC, Australia. Zhong, J., Zhu, H., Li, J., and Yu, Y. 2002. Conceptual graph matching for semantic search. Conceptual Struct. pp. 92–106. Zhong, X., Kaalia, R., and Rajapakse, J.C. 2019. GO2Vec: Transforming GO terms and proteins to vector representations via graph embeddings. BMC Genomics 20, 918. Address correspondence to: Dr. Esmaeil Nourani Department of Information Technology Faculty of Computer Engineering and Information Technology Azarbaijan Shahid Madani University 35 Km Tabriz-Maragheh Road Tabriz 53714-161 Iran E-mail: ac.nourani@azaruniv.ac.ir