Uploaded by ilyesrezgui46

Code2vec: Learning Code Representations with AST Paths

advertisement
1-Introduction
Have you ever had the need to encode a code snippet written in a specific programming language as an
embedding vector capturing information of that code? If yes, this article is for you.
In this article we will walk through the research paper “ code2vec: Learning Distributed Representations
of Code” and we will discuss how they managed to train a neural network on syntactic-only data (AST
paths) to predict the name of a method given its body.
You can test the model via this UI : https://code2vec.org/
Figure 1: Prediciton example of the name of the method f() as done() (Source: https://code2vec.org/)
2-Abstract syntax trees (AST) :
An Abstract Syntax Tree (AST) is a data structure used in computer science to represent the syntactic
structure of a program. It is typically generated during the process of parsing a program written in a
programming language.
For example, if we take this method f() written in Java and parse it using the Java parser, we will get as
output the abstract syntax tree shown in Figure 3.
Figure 2 : Java code snippet (source : https://arxiv.org/abs/1803.09473)
Figure 3 : AST of the code shown in Figure 2 (source: https://arxiv.org/abs/1803.09473)
Once the tree is generated, we can extract “AST_paths” from it. An “AST path”, also known as an “AST
traversal path” is a sequence of steps that defines a specific route or a sequence of nodes within an
Abstract Syntax Tree. From Figure 3, we can extract these paths, for example :
The path between (boolean, ?):
Primitive ↑ MethodDecl ↓ Name
Figure 4: example 1 (source: author)
The path between (boolean, Object) :
Primitive ↑ MethodDecl ↓ Parameter ↓ Class
Figure 5: Example 2 (source :author)
The path between (boolean, target):
Primitive ↑ MethodDecl ↓ Parameter ↓ VarDeclId
Figure 6: Example 3 (source: author)
3- Code Representaion
A code snippet “P” is represnted by the bag of path_contexts denoted by “B” where B = {bᵢ | i ∈ [1, n]}
and n=the number of paths extracted from the AST after parsing “P”. Each bᵢ is a triplet of (xₛ ,p, xₛ)
with :
xₛ: represents the initial node
p: the path from the node xₛ to the node xₛ
xₛ: represents the final node
Figure 7 : Example of “B” the bag_contexts for the method f() (source: author)
The model is going to learn a couple of matrices during its training. In this section we will first explain
path_vocab and value_vocab, then we will discuss how they are used to construct the context vector:
embeddings are explained clearly in this article.
Path_vocab is an embedding matrix that contains in each row an embedding of a specific AST path.
(each AST path is mapped into a row in the path_vocab matrix).
Figure 8: Illustration of the Path_vocab matrix (source: author)
Value_vocab is also a matrix containing embedding vectors of all possible tokens in the vocabulary.
(each token is also represented as a vector of size d)
Figure 9: Illustration of the Value_vocab matrix (source: author)
It is worth noting the width of these matrices is represented by “d” and is generally a number between
100 and 500.
Out of these two matrices, we are going to construct context vectors for each path context bᵢ, where i
belongs to the interval [1, n]. For example, if we take b₁ as “(boolean, Primitive ↑ MethodDecl ↓ Name,
?)”, we extract the two rows from Value_vocab that represent the token embeddings of “boolean” and
“?”. Additionally, we extract the row from the Path_vocab matrix that represents the embedding of the
path “Primitive ↑ MethodDecl ↓ Name”. By concatenating these 3 vectors, we obtain a context vector of
size 3d, denoted by c₁.
Example of embedding context vector for the context path (boolean, Primitive ↑ MethodDecl ↓ Name, ?)
(source: author)
After constructing all of these context vectors, they are passed through a fully connected layer to
compress their width from 3d to 1d. This compression is achieved by performing matrix multiplication
between each context vector “cᵢ” of size(1,3d) and a learned and optimized weight matrix “W” of
size(3d,d). The computation can be expressed as follows and results in a (1,d) vector denoted by c̃ᵢ :
c̃ᵢ = tanh(W · cᵢ)
Illustartion of the contect vecotr creation (source: author)
Now, the question arises: how can we aggregate all of these context vectors into a single fixed-size
vector that captures information about the entire code snippet, rather than just a single path in its
Abstract Syntax Tree (AST)? The answer to this lies in the attention mechanism.
By applying attention, we assign weights to each context vector based on its relevance to the overall
understanding of the code snippet. These weights are calculated by comparing the similarity between a
query vector denoted by “a” in the paper and each context vector “c̃ᵢ”, with i in the interval [1, n]. The
query vector, often referred to as the “context vector” or “attention vector,” is derived from a learnable
weight matrix. The attention weights denoted by “αᵢ”, where i belongs to the interval [1, n].are then
applied to the context vectors, allowing us to focus more on the most important parts of the code
snippet while downplaying the less significant ones.
Attention (source: author)
The exponents in the equations serve the purpose of ensuring that the attention weights are positive.
These weights are then normalized by dividing them by their sum, following the standard softmax
function. This normalization ensures that the sum of the attention weights is equal to 1.
The aggregated code vector υ, which represents the entire code snippet, is obtained by taking a linear
combination of the combined context vectors {c̃₁, …, c̃ₙ} weighted by their corresponding attention
weights. In other words, the attention weights are used as factors for the combined context vectors c̃ᵢ.
Context vecotr calculation (source: https://arxiv.org/abs/1803.09473)
This implies that the attention weights are non-negative, summing up to 1, and act as coefficients for
the context vectors. Thus, the attention mechanism can be interpreted as a weighted average, where
the weights are learned and computed with the other elements in the collection of path-context
4-Conclusion
In conclusion, Code2Vec is a significant step forward in the realm of code representation and
understanding, providing a solid foundation for further advancements in the field. By embracing this
novel approach, we can unlock new possibilities for improving code comprehension, boosting
productivity, and advancing the state of software engineering.
Download