Uploaded by Esmaeil Nourani

published final

advertisement
JOURNAL OF COMPUTATIONAL BIOLOGY
Volume 28, Number 12, 2021
# Mary Ann Liebert, Inc.
Pp. 1196–1207
DOI: 10.1089/cmb.2021.0069
GoVec: Gene Ontology Representation Learning
Using Weighted Heterogeneous Graph and Meta-Path
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
ESMAEIL NOURANI1,*,i
ABSTRACT
Biomedical knowledge graphs are crucial to support data-intensive applications in the life
sciences and health care. These graphs can be extended by generating a heterogeneous graph
that contains both ontology terms and biomedical entities. However, state-of-the-art approaches for Gene Ontology representation learnings are constrained to homogeneous graphs
that cannot represent different node types and relations. To address this limitation, we present GoVec to produce representations seamlessly for both ontologies and biological entities
by utilizing meta-path-based representation learning in the heterogeneous graph. The resulting vectors can be used in many bioinformatics applications, particularly for calculating
semantic similarity and extracting relations among biological entities. We verify the approach’s usefulness by comparing the resulting semantic similarities with the manually produced
similarities by the experts. Furthermore, the superiority of the GoVec is shown by an extensive set of quantitative and qualitative evaluations. Two downstream tasks, including
protein–protein interaction and protein family similarity, are evaluated in comparison with
many state-of-the-art approaches. Finally, as a qualitative visual representation, the separability of various protein families is examined and visually separable groups of proteins are
generated, which shows the capability of GoVec representations to embed functional semantics into the vectors.
Keywords: Gene Ontology, heterogeneous graph, meta-path, representation learning.
1. INTRODUCTION
B
iological knowledge captures various aspects of biological phenomena. Nowadays, they are
represented in formal and structured biomedical ontologies, thanks to the many years of research
(Bodenreider, 2008). Ontology annotations can be used for categorizing activities and associations of biological entities (Smith et al., 2007). Annotating is performed by associating specific terms of the ontology
along with the evidence meta-data to the biological entities such as genes and gene products (Hill et al., 2008;
Smaili et al., 2018).
1
Department of Information Technology, Faculty of Computer Engineering and Information Technology, Azarbaijan
Shahid Madani University, Tabriz, Iran.
*Current address: Novo Nordisk Foundation Center for Protein Research, The Faculty of Health Sciences, University
of Copenhagen, Copenhagen, Denmark.
i
ORCID ID (https://orcid.org/0000-0003-1933-2550).
1196
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1197
Specifically, Gene Ontology (GO) annotates gene products employing a set of structured vocabularies
(Harris et al., 2004). GO vocabularies consist of three different categories: cellular component (CC), biological process (BP), and molecular function (MF), which are formed as a directed acyclic graph. Nodes of
the graph are called terms that are related to other terms mostly based on a parent–child relationship. There
is a complex hierarchy in the graph since every node may have multiple parents and more than a child.
Biological knowledge graphs and ontology-based annotations are crucial for data science-based applications in the life sciences and health care. Comparison of biological entities in the graph can be performed
based on their ontological descriptions. The extracted information can be used for finding similarities between biomedical entities based on computing semantic similarity. Consequently, this knowledge can be
used in many applications such as protein–protein interaction (PPI) prediction and discovering associations
between diseases and genes.
GO (Ashburner et al., 2000) as the most widely used knowledge graph has attracted tremendous attention
in bioinformatics applications. Semantic similarity computation between GO terms has attracted many
research attentions. Most of the pioneered studies estimate the similarity based on information content (IC)
(Lin, 1998; Resnik, 1999). However, statistics about the common ancestors in the graph are utilized in
many studies (Lord et al., 2003; Sevilla et al., 2005; Yang et al., 2012; Teng et al., 2013; Song et al., 2014).
Produced semantic similarity can be used as the feature for computational approaches in many applications
such as drug repositioning (Gottlieb et al., 2011) or extracting genomic variants (Robinson et al., 2014;
Boudellioua et al., 2017). The commonality of semantic similarity-based methods is that generated features
cannot include the ontology structure and consequently, this information will be available to the machine
learning method.
Besides developing the semantic-similarity-based approaches for the past two decades, ontology-based
annotations are also used in bioinformatics studies. In this approach, one-hot binary vectors are generated
representing if an entity is annotated with a term or not (Sokolov et al., 2013). In this case, ontology is used
to generate the binary feature vectors, but the structure is not available implicitly or explicitly to the machine learning approach.
Therefore, producing feature vectors that directly encode both the structure and annotations of entities
can significantly outperform the previous approaches. Recently word embedding-based approaches are
used for this reason (Smaili et al., 2018, 2019; Duong et al., 2019; Zhong et al., 2019) based on similar
ideas from natural language processing. Latent features produced by these approaches implicitly encode for
the structure of the ontology and consequently are used for the calculation of similarity between biological
entities.
Nowadays, the concept of heterogeneous graph representation learning is at the center of attention
(Dareddy et al., 2019; Wang et al., 2019; Zhang et al., 2019; Molaei et al., 2020) because of their abilities
to overcome the constraint of the traditional network embedding methods by supporting multiple types
of nodes and links among them. Embedding-based approaches for ontology graph can be significantly
extended to contain a diverse set of entities and ontologies by utilizing the recently introduced concept
of heterogeneous graph representation learning called metapath2vec (Dong et al., 2017).
We present GoVec to jointly generate representation vectors for ontology terms and biological entities
based on a weighted heterogeneous graph and meta-path (Dong et al., 2017). GoVec can be significantly
extended in future studies to support more ontologies and entities. However, in this study, we just propose
the applicability of utilizing the potentials of heterogeneous graphs in favor of generating richer and
flexible representation. As a case study, we produce vectors for GO terms along with proteins, but the
approach can be easily extended to contain more entities and their relations.
2. METHODS
In this study, we present a novel representation learning method called GoVec for heterogeneous
knowledge graphs, specifically for GO. Table 1 shows the statistics of the employed GO.
Recent state-of-the-art approaches for GO representation learning (Smaili et al., 2018; Zhong et al.,
2019) suffer from serious limitations. The first issue is transforming the directed GO graph to an undirected
graph that might lead to the loss of structural information. The second problem is considering different
relations as a similar edge in the graph. For instance, relations among terms are different than term-protein
annotations, but both are equally considered as a regular edge.
1198
NOURANI
Table 1. Statistics of the Used Gene Ontology
Category
BP
MF
CC
Total
Is_a
Negatively_regulates
Part_of
Positively_regulates
Node/edge
Count
Node
Node
Node
Node
Edge
Edge
Edge
Edge
29,457
11,093
4183
44,733
75,161
2822
7544
2799
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
BP, biological process; CC, cellular component; MF, molecular function.
To address these challenges, we utilize heterogeneous network embedding based on meta-path, which
allows the approach to distinguish between various node types and edge types. We emphasize the usefulness of having various node types in the graph by seamlessly considering different ontology types and
biological entities.
As a simple case study, we only consider proteins as the biological entity along with different types of
GO terms and the various relations among them to form a heterogeneous graph. A similar approach can be
easily considered for different biomedical terms such as chemicals, disease, and phenotypes. In this study,
we utilize three categories of source data to generate the heterogeneous graph.
The first and most important input is the GO graph containing different term types and relations among
them. The edge weight for all edges of the base GO graph in the first step is the maximum value, which is 1.
Figure 1 shows the ancestor graph for the term GO:0022857 (transmembrane transporter activity). Most
of the relation types in the graph are the parent–child relation, which is called ‘‘is-a,’’ but there are other
relation types such as ‘‘part-of,’’ which can be considered as the connection between different categories
of the GO graph. As shown in Figure 1 the base leaf term, which is of type MF, is connected to a BP term
using the ‘‘Part-of’’ relation.
FIG. 1. GO graph ancestor chart for GO:0022857 (adapted from https://www.ebi.ac.uk/QuickGO/term/GO:0022857).
GO, Gene Ontology.
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1199
The second input for generating the graph is the list of co-occurrences extracted from QuickGo (Huntley
et al., 2009, 2015). For each term, we have extracted the list of GO terms that are commonly co-annotated
to a gene product. Between every pair of co-occurred terms, we add an edge to the GO graph of the first
step. It is important to note that for every term there is a sorted list of terms based on a score that shows the
significance of co-occurrence. Therefore, for each compared term, an overlap significance is computed
based on the ratio of proteins annotated by both terms to the number of all proteins annotated by at least one
of them. We use the overlap significance score as the weight for the edge between the co-occurred terms
in the graph. This score is a value between 0 and 100, which is scaled to 0 and 1 when used as the edge
weight.
The third input to the graph is the list of annotated proteins for each of the GO terms. There will be an
edge between every annotated protein and the GO term. Weights of the edges are normalized based on the
number of terms for every protein. That is to say that if 5 terms are annotating a protein, by dividing the 1
by the number of annotating terms, edge weight for the 5 generated edges will be equal to 0.2. It is important to note that node types for proteins and GO terms are not the same and this will be considered for
defining paths when metapath2vec is applied.
Figure 2 illustrates details of the proposed approach to generate representations for GO terms and proteins. Since an integrated heterogeneous graph is used as the input to the embedding method, representations are generated jointly for different node types, including GO terms and proteins.
2.1. Using meta-path-based random walks to generate representations
To produce representations we use the metapath2vec approach (Dong et al., 2017). The advantage of
this approach in comparison with traditional word2vec-based (Mikolov et al., 2013) approaches such
as node2vec (Grover and Leskovec, 2016), LINE (Tang et al., 2015), and DeepWalk (Perozzi et al., 2014)
is the ability to consider multiple types of nodes and links, which is the case we have in this study and can
be easily extended to other domains. The conventional network embedding approach is constrained to a
homogeneous graph that has the same node type and edge type. However, most real-world applications
require more flexibility to capture various entities and relation types.
metapath2vec proposes a heterogeneous edition of the skip-gram model. Skip-gram has employed in all
of the word-embedding-based approaches for generating representations using random walks. In the heterogeneous meta-path-based random walks, there is no bias toward highly visible types of nodes. Meta-path
generates paths that are capable of extracting both semantic and structural correlations among various types
of nodes.
FIG. 2.
The workflow of the presented GoVec method for representation learning.
1200
NOURANI
2.1.1. Heterogeneous network representation learning. A heterogeneous network is defined as
a graph G = (V,E,T), where node v and edge e may have different types based on: /ðtÞ : V ! TV and
uðtÞ : E ! TE respectively. TV and TE are node types and edge types, where jTV j + jTE j > 2. In this study,
GO terms and proteins are two different node types but can be easily extended to other biomedical entities.
Given a heterogeneous network G, metapath2vec produces d dimensional vector representations
XRjV j · d ‚ d jV j, which captures structural and semantic relations. Where representation of each node v
is a d dimensional vector Xv in a unified latent space.
Metapath2vec enables the skip-gram model to produce effective representations for a heterogeneous
network by maximizing the probability of having the heterogeneous context Nt ðvÞ‚ t 2 TV given a node v:
XX X
log pðct jv; tÞ‚
(1)
arg max
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
h
v2V t2TV ct2 Nt ðvÞ
where Nt ðvÞ is the neighborhood of node v with the tth node type and pðct jv; tÞ defined as a softmax
Xct :Xv
function: pðct jv; tÞ = P e eXu :Xv where Xv is the vth row of X, which is the embedding vector of node v.
R1
u2V
R2
Rt
Rl - 1
A meta-path scheme is presented as a path in the form of q = V1 ! V2 ! . . . Vt ! Vt + 1 ! . . . ! Vl ,
which ends to the same starting node that is aimed to generate a representation for it. R is the relation
between nodes V. As a concrete example, if we consider three node types, including GO terms, proAnnotated By
IS - A
tein domains, and proteins, we can design a metapath as follows: Protein
!
GO Term !
Appear - in
Annotates
Contains
GO Term ! Protein ! Domain ! Protein. Previous studies have confirmed the usefulness
of meta-paths in hseterogeneous networks (Sun and Han, 2012; Sun et al., 2012; Dong et al., 2015).
As shown in Figure 2, we have defined various meta-paths that cover all node types in the graph, including GO terms and proteins. For example, the last meta-path in Figure 2 defines how two proteins can be
related to each other based on a common GO term. Equation 2 shows how meta-paths can guide heterogeneous random walkers by modifying transition probability p from node v denoted by vit , where i is the
step number and t is the node type:
i + 1 i
8 1
v ‚ v 2 E‚ /ðvi + 1 Þ = t + 1
i Þj
<
N
ð
v
j
t
+
1
i+1 i t
i + 1 it
i+1
:
(2)
p v jvt ‚ q =
i +v1 i‚vt 2 E‚ /ðv Þ 6¼ t + 1
: 0
= E
0
v ‚ vt 2
In Equation 2, vit 2 Vt and Nt + 1 vit are the Vt+1 type of neighborhood of the node vit . That is to say that
random walks are based on predefined meta-paths. Meta-path-based random walk properly feeds the semantic connections among various node types to the skip-gram model, where the final representation will
be generated. Table 2 shows the utilized parameters for the word2vec model for generating representations.
To compute the semantic similarity between the vectors of two proteins we use cosine similarity.
Produced scores using GoVec representations will be compared with the quoted precomputed semantic
similarity measures from the benchmark data sets for every protein pair.
3. RESULTS AND DISCUSSION
To evaluate the presented approach, we have conducted an extensive set of experiments from comparing
the results with manually analyzed ontologies by the expert to evaluating the method over benchmark data
sets and showing the superiority of GoVec in comparison with other approaches. Previous studies mostly
Table 2. Used Parameters for Training the Model
Parameter name
Sg
Length
N
D
Window
Definition
GoVec value
Selecting the training algorithm between skip-gram (sg = 1) and CBOW (sg = 0)
The maximum length of a random walk
Number of random walks per root node
Embedding size
Context window size
1
50
20
128
5
CBOW, continuous bag of words model.
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1201
evaluated their methods in terms of semantic similarity using collaborative evaluation of semantic similarity measures (CESSM) webtool (Pesquita et al., 2009). Unfortunately, this benchmark server is not
currently available and, in this study, we utilize a comprehensive and most recently introduced benchmark
(Cardoso et al., 2020), which is designed for knowledge graph-based similarity in the biomedical domain.
This benchmark covers different species and, in this study, human data are considered in all experiments.
This benchmark introduces reference proxy measures for similarity calculated based on protein MF, and
PPIs. Furthermore, they have provided semantic similarity computations with state-of-the-art representative
measures, for a comparative evaluation of the measures.
Benchmark data set provides two sets of protein pairs for every candidate species and utilizes both
breadth and depth class annotations. In the first set (One aspect), proteins of every pair should be annotated
by at least one of the GO aspects, whereas there is at least one leaf term among annotating terms. In the
second set (all aspects), proteins of every pair should have at least one annotation in each GO aspect, and
within the annotating terms, there should be at least one leaf term of every aspect. Table 3 summarizes the
statistics of the employed data sets from the benchmark, which are used for evaluating the performance of
GoVec in the following sections.
3.1. Semantic similarity calculation methods
GoVec is compared with four representative semantic similarity measures (SSMs) (Cardoso et al., 2020).
Each of these measures is designed using a combination of two methods: the first method used to compute
the IC of an annotating class (GO term; ICSeco or ICResnik) and the IC-based method to compute the similarity between the biomedical entities (simGIC or Best Match Average [BMA]).
BMA and simGIC as IC-based entity similarity methods are high-performance classical measures for
semantic similarity, which are widely accepted in the research community, unlike the new structure-based
measures. ICSeco, introduced by Seco et al. (2004) is computed according to the number of children of
class c denoted by hðcÞ:
ICSeco ðcÞ = 1 -
logðhðcÞ + 1Þ
‚
logðN Þ
(3)
where N is the total number of classes in the ontology.
ICResnik is presented by Resnik (1999) according to the entities annotated using class c in the GO graph:
ICResnik = - log pðcÞ‚
(4)
where pðcÞ is the probability of annotation in the corpus.
BMA is a pairwise measure in which the similarity of two classes is computed based on the common
ancestor of a pair (Resnik, 1999). For every class, BMA considers the most similar class:
P
P
simðc2 ‚ c1 Þ
c1 2CA simðc1 ‚ c2 Þ
+ c22CB
‚
(5)
BMAðA‚ BÞ =
2j C A j
2jCB j
where A and B are entities (i.e., protein), C is the set of classes c that annotates each of the entities, and
sim(c1, c2) returns the highest similarity values for the classes. The similarity between the two classes is
computed by Resnik similarity:
simðc1 ‚ c2 Þ = maxðIC ðaÞÞ : a 2 Aðc1 Þ \ Aðc2 Þ‚
(6)
Table 3. Number of Protein Pairs in the Two Data Sets of the Benchmark for Each
of the Reference Proxy Measures
Proxy measure
One aspect/all aspects
Number of protein pairs
Protein family
One aspect
All aspects
One aspect
All aspects
31,350
25,527
30,826
29,672
PPI
PPI, protein–protein interaction.
1202
NOURANI
where a is a class in the ancestors’ set of ci [i.e., A(ci)]. SimGIC (Pesquita et al., 2008) is similar to Jaccard
similarity and every class is weighted by its IC:
P
IC ðcÞ
:
(7)
SimGIC ðA‚ BÞ = Pc2CA \ CB
c2CA [ CB IC ðcÞ
To sum up, four state-of-the-art methods to compute the semantic similarity are utilized in the benchmark: BMAResnik, BMASeco, simGICResnik, and simGICSeco by combing IC computation methods (ICSeco or
ICResnik) and entity similarity computation methods (simGIC or BMA). In the following sections, the results
of comparing GoVec with these four methods are discussed. Semantic similarity results of these methods
are directly extracted from the benchmark (Cardoso et al., 2020) for every protein pairs of the data sets, but
the semantic similarity of the GoVec is produced using cosine similarity between two protein vectors.
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
3.2. Protein family similarity
The first proxy measure for evaluating other approaches is MF similarity, which is calculated based on
comparing the common domains of the two candidate protein sequences. Functional domains of proteins
are gathered from the Pfam database (El-Gebali et al., 2019). Pfam similarity (SimPfam) is calculated using
Jaccard similarity, based on the number of common families:
SimPfam ðA‚ BÞ =
fA \ fB
‚
fA [ fB
(8)
where fA and fB are set of families for the two proteins. If they share more functional domains, the SSM especially in
the MF aspect will be higher. Based on the result in Table 4, GoVec outperforms the other approaches.
3.3. PPI similarity
The second proxy measure is designed based on PPI. In this measure, two proteins are considered to be similar if
they are interacting. Correlation between the PPI and GO-based semantic similarity is reasonable when two
proteins share the same cellular location and same BP. In this case, they are more likely to interact and share
common CC and BP terms. This is verified in some studies to gain more accuracy in PPI prediction by using terms
of these two aspects (Sousa et al., 2020). However, noninteracting proteins can be considered as similar based on
different metrics. That is why we consider both PPI and MF similarity as the complementary proxy measures.
Table 5 shows the results of different semantic similarity calculation methods.
3.4. Comparison with word2vec-based approaches
The advantage of GoVec in comparison with traditional word2vec-based approaches such as node2vec
and Onto2vec (Smaili et al., 2018) is the ability to consider multiple types of nodes and links among them.
Onto2vec is a word2vec-based approach for representation generation for GO terms and accordingly
proteins based on GO knowledge graph. To compare the performance of GoVec with other word2vec-based
methods, we add protein domains as other entity type to the knowledge graph. For this reason, we utilize
Interpro database (Hunter et al., 2009) and add protein domain associations to the graph. InterPro release
85.0 is used in this study and the statistics of the release is summarized in Table 6. Besides adding proteindomain associations, we add another relation to the graph between GO terms and Interpro domains, which is
based on mappings extracted from Interpro database. Therefore, the resulting heterogenous graph contains
Table 4. Pearson Correlation Coefficient Between the Pfam Similarity and Semantic
Similarity Measures Computed by the Methods
Method
BMAResnik
BMASeco
simGICResnik
simGICSeco
GoVec
PCC (one aspect)
PCC (all aspects)
0.692
0.733
0.647
0.663
0.753
0.702
0.743
0.659
0.674
0.761
Boldface represents best result.
BMA, Best Match Average; PCC, Pearson correlation coefficient.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1203
Table 5. Pearson Correlation Coefficient Between Protein–Protein Interaction
(Binary Value) and Semantic Similarity Computed by the Methods
for Every Pair of Proteins
Method
BMAResnik
BMASeco
simGICResnik
simGICSeco
GoVec
PCC (one aspect)
PCC (all aspects)
0.489
0.505
0.379
0.389
0.521
0.493
0.509
0.380
0.390
0.526
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
Boldface represents best result.
three types of entities, including proteins, GO terms, protein domains, and associations between them. As the
result we can produce more insightful meta-paths in the graph and generate representations for proteins based
on both associated GO terms and Interpro domains. As the marginal achievement, now we can generate
representations for Interpro domains besides providing representation for GO terms and proteins.
Considering that both Onto2vec and GoVec are word2vec-based approaches, to have a fare comparison in
terms of hyperparameters, GO graph, and benchmark data set, we train and evaluate a word2vec model based on
onto2vec’s approach but using the same hyperparameters, same GO graph, and benchmark data set of the GoVec.
As it is shown in Table 7, we provide several versions of GoVec based on contained entity types. The
first version only contains GO terms and proteins and corresponding associations types, whereas the next
version contains three types of entities, including GO terms, proteins, and domains. We checked the performance of the method without adding co-occurrences of GO terms. Results show that efficiency of adding
domain nodes to the graph is more significant than adding co-occurrence terms. However, there is still
performance improvement by adding co-occurrences. To show the superiority of defining meta-paths
we consider an exactly similar graph containing all three node types and associations with exactly same
hyperparameters for the word2vec model, but without meta-paths and only generate representations based
on the approach used by Onto2vec. Results of Table 7 are promising, where we achieve best results for the
version of GoVec that utilizes meta-paths and domain information in comparison with original word2vec.
Furthermore, we provide GoVec results before adding domain nodes, which shows the advantage of
utilizing more entity types and relation types in the graph. As it is expected, integrating domain nodes to the
graph results in considerably improved performance for Pfam similarity task in comparison with PPI task.
Performance of onto2vec for the Pfam similariy task is the lowest among other approaches, which can
be due to the fact that domain information is not utilized in this method. Pfam similarity task is clearly
associated with domain information and improved results by adding domain types to the graph is expectable, which signifies the advantage of diverse entity types in the graph.
The aim of this experiment is to show the flexibility of the approach to integrate more biomedical entity
types and achieve more comprehensive knowledge graph. Clearly as the future steps, other valuable node
and edge types can be added such as pathways and which proteins are member of pathways can be
considered as the edges in the graph. Furthermore, hierarchy of pathways will be another insightful relation
and can be added as a pathway–pathway relation to the graph.
3.5. Comparison of GoVec with human ratings
In this section, we evaluate the semantic similarities computed using GoVec representations with that
of the semantic similarities computed by the experts. Rong et al. (2006) selected 25 pairs of GO terms that
Table 6. Statistics of the Employed Interpro Database (Release 85.0 on April 2021)
Entry type
Count
Homologous superfamily
Family
Domain
Repeat
Site
Total
3285
22,891
11,218
322
917
38,633
1204
NOURANI
Table 7. Comparison Between GoVec Versions and Other word2vec-Based Approaches
Metric
Method
Onto2Vec (GO + Proteins)
Node2Vec (GO + Proteins + Domains)
GoVec (GO + Proteins)
GoVec (GO + Proteins + Co-Occurences))
GoVec (GO + Proteins + Domains)
GoVec (GO + Proteins + Domains + Co-Occurences)
PCC (Pfam similarity)
PCC (PPI)
One aspect
All aspects
One aspect
All aspects
0.698
0.778
0.749
0.753
0.795
0.808
0.701
0.781
0.756
0.761
0.798
0.811
0.529
0.489
0.514
0.521
0.520
0.523
0.538
0.493
0.518
0.526
0.525
0.529
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
Boldface represents best result.
GO, Gene Ontology.
include various levels of similarity and asked 10 biologists to grade these pairs based on their similarity
ranging from 0 to 10. An average of 10 grades for every pair is considered as the reference semantic
similarity, which is compared with the semantic similarity values of the methods. In this study, we directly
quote the values of compared methods from Xu et al. (2013) and compute similarity grades for these pairs
using GoVec representations and finally calculate the Pearson correlation coefficients (PCCs) between
these grades and human ratings. Higher PCC represents better performance and more similarity to the
manual expert calculations by the biologists. Table 8 shows the results that verify the superiority of the
GoVec in comparison with seven referenced methods that are extracted from Xu et al. (2013).
3.6. Clustering and visualization
GoVec representations can be used for unsupervised clustering and visualizations to recognize biologically related groups. As the case study, the separability of proteins to the functional groups is evaluated. In
this study, enzyme proteins employed in Onto2vec for the same visualization purpose are used.
First, t-distributed Stochastic Neighbor Embedding (t-SNE) (van der Maaten and Hinton, 2008) is used
for dimensionality reduction to convert high-dimensional (128 D) GoVec vectors to the two dimensional
(2D) points to be used for visualization. t-SNE is similar to principal component analysis (PCA) but can
extract nonlinear patterns from the data points using probability distributions, whereas linear methods such
as PCA are not able to capture nonlinear structures. To achieve a better semantic from the separated groups
in the visual representation, similar to Onto2Vec, enzyme commission (EC) numbers for proteins are used
as distinguishing six different labels based on the top-level EC category of every protein. As shown in
Figure 3, visually separable groups have the same color (single EC top-level category), which shows the
capability of GoVec representations to embed functional semantics into the vectors.
As a quantitative evaluation of separability based on EC top-level category, K-means clustering is
applied over the GoVec representation (k = 6) and cluster purity is calculated:
Table 8. Pearson Correlation Coefficient Between Human Ratings (for Every Pair)
and the Semantic Similarities Computed by the Methods
Method
PCC
Combine (Rong et al., 2006)
Lin (Lin, 1998)
simUI (Gentleman, 2006)
Wang (Wang et al., 2007)
Resnik (Resnik, 1999)
(Zhong et al., 2002)
SSDD (Xu et al., 2013)
GoVec
0.864
0.865
0.834
0.826
0.824
0.714
0.894
0.921
SSDD, shortest semantic differentiation distance.
Boldface represents best result.
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1205
FIG. 3. t-SNE visualization of proteins (Employed in Onto2vec), using GoVec vectors based on first level EC
numbers. EC, enzyme commission; t-SNE, t-distributed Stochastic Neighbor Embedding.
purityðT‚ CÞ =
k
1X
max ck [ tj ‚
N i=0 j
(9)
where N is the number of proteins, T is the computed clusters, and C is the set of clusters formed based on EC
classes; there are six clusters since there are six categories based on the first category of EC numbers. The resulting
purity is 0.56 in comparison with 0.42 purity achieved by Onto2vec that shows significant improvements.
4. CONCLUSIONS
In this study, we introduced GoVec, a novel meta-path-based representation learning for the heterogeneous GO
graph. We address the limitations of state-of-the-art approaches for GO representation learnings because they are
constrained to homogeneous graphs that cannot represent different node types and relations. GoVec produces
representations seamlessly for both ontologies and biological entities. To evaluate the effectiveness of GoVec
vectors, we computed semantic similarity between vectors of biological entities and compared the results with proxy
measures, including protein MF similarity and PPI. We employed public benchmark data sets that are specifically
designed for evaluating semantic similarity computation methods. GoVec results consistently outperform other
state-of-the-art methods based on the conducted experiments. Besides comparison with computational approaches,
we compared the produced similarity scores with the similarity values produced by human experts, which shows
>90% correlation between them. Finally, as a qualitative visual representation, the separability of various protein
families is examined based on top-level EC numbers. The proposed approach can be easily extended to other
biomedical knowledge graphs that contain different types of entities and various relation types among them.
AUTHOR DISCLOSURE STATEMENT
The author declares there are no competing financial interests.
FUNDING INFORMATION
This research did not receive any specific grant from funding agencies in the public, commercial, or notfor-profit sectors.
1206
NOURANI
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
REFERENCES
Ashburner, M., Ball, C.A., Blake, J.A., et al. 2000. Gene Ontology: Tool for the unification of biology. Nat. Genet. 25, 25–29.
Bodenreider, O. 2008. Biomedical ontologies in action: Role in knowledge management, data integration and decision
support. Yearb. Med. Inform. 17, 67–79.
Boudellioua, I., Mahamad Razali, R.B., Kulmanov, M., et al. 2017. Semantic prioritization of novel causative genomic
variants. PLoS Comput. Biol. 13, e1005500.
Cardoso, C., Sousa, R.T., Köhler, S., et al. 2020. A collection of benchmark data sets for knowledge graph-based
similarity in the biomedical domain. Database 2020, baaa078.
Dareddy, M.R., Das, M., and Yang, H. 2019. motif2vec: Motif aware node representation learning for heterogeneous
networks, 1052–1059. In 2019 IEEE International Conference on Big Data (Big Data). Presented at the 2019 IEEE
International Conference on Big Data (Big Data), IEEE, Los Angeles, CA, USA.
Dong, Y., Chawla, N.V., and Swami, A. 2017. metapath2vec: Scalable representation learning for heterogeneous
networks, 135–144. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. Presented at the KDD’17: The 23rd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ACM, Halifax NS Canada.
Dong, Y., Zhang, J., Tang, J., et al. 2015. CoupledLP: Link prediction in coupled networks, 199–208. In Proceedings of
the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Presented at the
KDD’15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM,
Sydney, NSW, Australia.
Duong, D., Ahmad, W.U., Eskin, E., et al. 2019. Word and sentence embedding tools to measure semantic similarity of
Gene Ontology terms by their definitions. J. Comput. Biol. 26, 38–52.
El-Gebali, S., Mistry, J., Bateman, A., et al. 2019. The Pfam protein families database in 2019. Nucleic Acids Res. 47,
D427–D432.
Gentleman, R. 2005–6. Visualizing and distances using GO. http://www.bioconductor.org/docs/vignettes.html (2005).
Gottlieb, A., Stein, G.Y., Ruppin, E., et al. 2011. PREDICT: A method for inferring novel drug indications with
application to personalized medicine. Mol. Syst. Biol. 7, 496.
Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks, 855–864. In Proceedings of the
22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16. Association for
Computing Machinery, New York, NY, USA.
Harris, M.A., Clark, J., Ireland, A., et al. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic
Acids Res. 32, D258–D261.
Hill, D.P., Smith, B., McAndrews-Hill, M.S., et al. 2008. Gene Ontology annotations: What they mean and where they
come from. BMC Bioinformatics 9, S2.
Hunter, S., Apweiler, R., Attwood, T.K., et al. 2009. InterPro: The integrative protein signature database. Nucleic Acids
Res. 37, D211–D215.
Huntley, R.P., Binns, D., Dimmer, E., et al. 2009. QuickGO: A user tutorial for the web-based Gene Ontology browser.
Database 2009, bap010.
Huntley, R.P., Sawford, T., Mutowo-Meullenet, P., et al. 2015. The GOA database: Gene Ontology annotation updates
for 2015. Nucleic Acids Res. 43, D1057–D1063.
Lin, D. 1998. An information-theoretic definition of similarity, 296–304. In: Fifteenth International Conference on
Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA.
Lord, P.W., Stevens, R.D., Brass, A., et al. 2003. Investigating semantic similarity measures across the Gene Ontology:
The relationship between sequence and annotation. Bioinformatics 19, 1275–1283.
Mikolov, T., Sutskever, I., Chen, K., et al. 2013. Distributed representations of words and phrases and their compositionality, 3111–3119. In Burges, C.J.C., Bottou, L., Welling, M., et al., eds. Advances in Neural Information
Processing Systems 26. Curran Associates, Inc., Red Hook, NY.
Molaei, S., Zare, H., and Veisi, H. 2020. Deep learning approach on information diffusion in heterogeneous networks.
Knowl. Based Syst. 189, 105153.
Perozzi, B., Al-Rfou, R., and Skiena, S. 2014. DeepWalk: Online learning of social representations, 701–710. In
Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD’14. Association for Computing Machinery, New York, NY, USA.
Pesquita, C., Faria, D., Bastos, H., et al. 2008. Metrics for GO based protein semantic similarity: A systematic
evaluation. BMC Bioinformatics 9, S4.
Pesquita, C., Pessoa, D., Faria, D., et al. 2009. CESSM: Collaborative evaluation of semantic similarity measures.
JB2009 Challeng. Bioinform. 157, 190.
Resnik, P., 1999. Using information content to evaluate semantic similarity in a taxonomy, 448–453. In Presented at the Proceedings of the 14th international joint conference on Artificial intelligence, Morgan Kaufmann Publishers Inc., Montreal.
Downloaded by Copenhagen University Library from www.liebertpub.com at 12/31/21. For personal use only.
GOVEC: GENE ONTOLOGY REPRESENTATION LEARNING
1207
Robinson, P.N., Kohler, S., Oellrich, A., et al. 2014. Improved exome prioritization of disease genes through crossspecies phenotype comparison. Genome Res. 24, 340–348.
Rong, L., Shunliang, C., Yuanyuan, L., et al. 2006. A measure of semantic similarity between gene ontology terms
based on semantic pathway covering. Progr. Nat. Sci. 16, 721–726.
Seco, N., Veale, T., and Hayes, J. 2004. An intrinsic information content metric for semantic similarity in WordNet,
1089–1090. In Proceedings of the 16th European Conference on Artificial Intelligence, ECAI’04. IOS Press, NLD.
Sevilla, J.L., Segura, V., Podhorski, A., et al. 2005. Correlation between gene expression and GO semantic similarity.
IEEE ACM Trans. Comput. Biol. Bioinf. 2, 330–338.
Smaili, F.Z., Gao, X., and Hoehndorf, R. 2018. Onto2Vec: Joint vector-based representation of biological entities and
their ontology-based annotations. Bioinformatics 34, i52–i60.
Smaili, F.Z., Gao, X., and Hoehndorf, R. 2019. OPA2Vec: Combining formal and informal content of biomedical
ontologies to improve similarity-based prediction. Bioinformatics 35, 2133–2140.
Smith, B., Ashburner, M., Rosse, C., et al. 2007. The OBO foundry: Coordinated evolution of ontologies to support
biomedical data integration. Nat. Biotechnol. 25, 1251.
Sokolov, A., Funk, C., Graim, K., et al. 2013. Combining heterogeneous data sources for accurate functional annotation
of proteins. BMC Bioinformatics 14, S10.
Song, X., Li, L., Srimani, P.K., et al. 2014. Measure the semantic similarity of GO terms using aggregate information
content. IEEE ACM Trans. Comput. Biol. Bioinf. 11, 468–476.
Sousa, R.T., Silva, S., and Pesquita, C. 2020. Evolving knowledge graph similarity for supervised learning in complex
biomedical domains. BMC Bioinformatics 21, 6.
Sun, Y., and Han, J. 2012. Mining heterogeneous information networks: Principles and methodologies. Synth. Lect.
Data Min. Knowl. Discov. 3, 1–159.
Sun, Y., Norick, B., Han, J., et al. 2012. Integrating meta-path selection with user-guided object clustering in heterogeneous information networks, 1348. In Proceedings of the 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining—KDD’12. Presented at the 18th ACM SIGKDD international conference,
ACM Press, Beijing, China.
Tang, J., Qu, M., Wang, M., et al. 2015. LINE: Large-scale information network embedding, 1067–1077. In Proceedings of the 24th International Conference on World Wide Web. Presented at the WWW’15: 24th International
World Wide Web Conference, International World Wide Web Conferences Steering Committee, Florence Italy.
Teng, Z., Guo, M., Liu, X., et al. 2013. Measuring gene functional similarity based on group-wise comparison of GO
terms. Bioinformatics 29, 1424–1432.
van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605.
Wang, J.Z., Du, Z., Payattakool, R., Philip, S.Y., and Chen, C-F. 2007. A new method to measure the semantic
similarity of GO terms. Bioinformatics 23, 1274–1281.
Wang, W., Yin, H., Du, X., et al. 2019. Online user representation learning across heterogeneous social networks, 545–
554. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information
Retrieval. Presented at the SIGIR’19: The 42nd International ACM SIGIR Conference on Research and Development
in Information Retrieval, ACM, Paris, France.
Xu, Y., Guo, M., Shi, W., et al. 2013. A novel insight into Gene Ontology semantic similarity. Genomics 101, 368–375.
Yang, H., Nepusz, T., and Paccanaro, A. 2012. Improving GO semantic similarity measures by exploring the ontology
beneath the terms and modelling uncertainty. Bioinformatics 28, 1383–1389.
Zhang, C., Swami, A., and Chawla, N.V. 2019. SHNE: Representation learning for semantic-associated heterogeneous
networks, 690–698. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining.
Presented at the WSDM’19: The Twelfth ACM International Conference on Web Search and Data Mining, ACM,
Melbourne, VIC, Australia.
Zhong, J., Zhu, H., Li, J., and Yu, Y. 2002. Conceptual graph matching for semantic search. Conceptual Struct. pp. 92–106.
Zhong, X., Kaalia, R., and Rajapakse, J.C. 2019. GO2Vec: Transforming GO terms and proteins to vector representations via graph embeddings. BMC Genomics 20, 918.
Address correspondence to:
Dr. Esmaeil Nourani
Department of Information Technology
Faculty of Computer Engineering and Information Technology
Azarbaijan Shahid Madani University
35 Km Tabriz-Maragheh Road
Tabriz 53714-161
Iran
E-mail: ac.nourani@azaruniv.ac.ir
Download