Code2vec: Learning Code Representations with AST Paths

1-Introduction Have you ever had the need to encode a code snippet written in a specific programming language as an embedding vector capturing information of that code? If yes, this article is for you. In this article we will walk through the research paper “ code2vec: Learning Distributed Representations of Code” and we will discuss how they managed to train a neural network on syntactic-only data (AST paths) to predict the name of a method given its body. You can test the model via this UI : https://code2vec.org/ Figure 1: Prediciton example of the name of the method f() as done() (Source: https://code2vec.org/) 2-Abstract syntax trees (AST) : An Abstract Syntax Tree (AST) is a data structure used in computer science to represent the syntactic structure of a program. It is typically generated during the process of parsing a program written in a programming language. For example, if we take this method f() written in Java and parse it using the Java parser, we will get as output the abstract syntax tree shown in Figure 3. Figure 2 : Java code snippet (source : https://arxiv.org/abs/1803.09473) Figure 3 : AST of the code shown in Figure 2 (source: https://arxiv.org/abs/1803.09473) Once the tree is generated, we can extract “AST_paths” from it. An “AST path”, also known as an “AST traversal path” is a sequence of steps that defines a specific route or a sequence of nodes within an Abstract Syntax Tree. From Figure 3, we can extract these paths, for example : The path between (boolean, ?): Primitive ↑ MethodDecl ↓ Name Figure 4: example 1 (source: author) The path between (boolean, Object) : Primitive ↑ MethodDecl ↓ Parameter ↓ Class Figure 5: Example 2 (source :author) The path between (boolean, target): Primitive ↑ MethodDecl ↓ Parameter ↓ VarDeclId Figure 6: Example 3 (source: author) 3- Code Representaion A code snippet “P” is represnted by the bag of path_contexts denoted by “B” where B = {bᵢ | i ∈ [1, n]} and n=the number of paths extracted from the AST after parsing “P”. Each bᵢ is a triplet of (xₛ ,p, xₛ) with : xₛ: represents the initial node p: the path from the node xₛ to the node xₛ xₛ: represents the final node Figure 7 : Example of “B” the bag_contexts for the method f() (source: author) The model is going to learn a couple of matrices during its training. In this section we will first explain path_vocab and value_vocab, then we will discuss how they are used to construct the context vector: embeddings are explained clearly in this article. Path_vocab is an embedding matrix that contains in each row an embedding of a specific AST path. (each AST path is mapped into a row in the path_vocab matrix). Figure 8: Illustration of the Path_vocab matrix (source: author) Value_vocab is also a matrix containing embedding vectors of all possible tokens in the vocabulary. (each token is also represented as a vector of size d) Figure 9: Illustration of the Value_vocab matrix (source: author) It is worth noting the width of these matrices is represented by “d” and is generally a number between 100 and 500. Out of these two matrices, we are going to construct context vectors for each path context bᵢ, where i belongs to the interval [1, n]. For example, if we take b₁ as “(boolean, Primitive ↑ MethodDecl ↓ Name, ?)”, we extract the two rows from Value_vocab that represent the token embeddings of “boolean” and “?”. Additionally, we extract the row from the Path_vocab matrix that represents the embedding of the path “Primitive ↑ MethodDecl ↓ Name”. By concatenating these 3 vectors, we obtain a context vector of size 3d, denoted by c₁. Example of embedding context vector for the context path (boolean, Primitive ↑ MethodDecl ↓ Name, ?) (source: author) After constructing all of these context vectors, they are passed through a fully connected layer to compress their width from 3d to 1d. This compression is achieved by performing matrix multiplication between each context vector “cᵢ” of size(1,3d) and a learned and optimized weight matrix “W” of size(3d,d). The computation can be expressed as follows and results in a (1,d) vector denoted by c̃ᵢ : c̃ᵢ = tanh(W · cᵢ) Illustartion of the contect vecotr creation (source: author) Now, the question arises: how can we aggregate all of these context vectors into a single fixed-size vector that captures information about the entire code snippet, rather than just a single path in its Abstract Syntax Tree (AST)? The answer to this lies in the attention mechanism. By applying attention, we assign weights to each context vector based on its relevance to the overall understanding of the code snippet. These weights are calculated by comparing the similarity between a query vector denoted by “a” in the paper and each context vector “c̃ᵢ”, with i in the interval [1, n]. The query vector, often referred to as the “context vector” or “attention vector,” is derived from a learnable weight matrix. The attention weights denoted by “αᵢ”, where i belongs to the interval [1, n].are then applied to the context vectors, allowing us to focus more on the most important parts of the code snippet while downplaying the less significant ones. Attention (source: author) The exponents in the equations serve the purpose of ensuring that the attention weights are positive. These weights are then normalized by dividing them by their sum, following the standard softmax function. This normalization ensures that the sum of the attention weights is equal to 1. The aggregated code vector υ, which represents the entire code snippet, is obtained by taking a linear combination of the combined context vectors {c̃₁, …, c̃ₙ} weighted by their corresponding attention weights. In other words, the attention weights are used as factors for the combined context vectors c̃ᵢ. Context vecotr calculation (source: https://arxiv.org/abs/1803.09473) This implies that the attention weights are non-negative, summing up to 1, and act as coefficients for the context vectors. Thus, the attention mechanism can be interpreted as a weighted average, where the weights are learned and computed with the other elements in the collection of path-context 4-Conclusion In conclusion, Code2Vec is a significant step forward in the realm of code representation and understanding, providing a solid foundation for further advancements in the field. By embracing this novel approach, we can unlock new possibilities for improving code comprehension, boosting productivity, and advancing the state of software engineering.

Code2vec: Learning Code Representations with AST Paths

Related documents

Products

Support

Code2vec: Learning Code Representations with AST Paths

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib