A SURVEY OF C ODE R EFINEMENT TASK Mark Baushenko Moscow m.baushenko@gmail.com A BSTRACT In this article, we will provide a detailed description of existing methods for solving the Code Refinement problem, analyze existing metrics for evaluating the quality of solution methods and try to suggest possible ways to improve existing methods. 1 Introduction In the field of natural language processing, new solutions or upgrades of existing ones appear every day, and it becomes difficult for scientists from all over the world to monitor progress. Therefore, in 2019, the GLUE (General Language Understanding Evaluation) [1] benchmark was created. It includes all the main tasks for natural language processing and, accordingly, data sets for each of the tasks. In order for all participants of the benchmark to be in the same conditions, the organizers have introduced a certain set of rules, observing which, it will be necessary to submit their solutions for evaluation. It became obvious that many approaches and ways of solving problems for natural language processing can be projected for programming language processing tasks, because they are both built on the basis of certain rules and relationships. And after the appearance of the GLUE benchmark, colleagues from Microsoft1 created a benchmark for processing the programming language – CodeXGLUE [2]. It includes 11 tasks for different types of input and output sequences: Code-Code, Text-Code, Code-Text, Text-Text. In this work, we will focus on one of the task - Code Refinement [3], and compare the existing solutions according to the following criteria: metrics, programming language used in training, tasks on which the preliminary training was performed, the content of the architecture and the number of parameters in it, the number of operations for calculations and memory usage. 2 Task description Localization and bug fixing is known to be a difficult and time-consuming task for software developers. The purpose of this task is to find an error in the code and fix it automatically. To solve this task, it is proposed to use architectures of the Encoder-Decoder type. 2.1 Datasets overview They used the dataset released by Tufano et al. [3] which is called Bugs2Fix. The source is buggy Java functions, whereas the target is the corresponding fixed functions. To build this dataset, they first download every public GitHub event between March 2011 and October 2017 from GitHub Archive2 and use the Google BigQuery APIs to identify all Java-file commits having a message containing the patterns [4]: (“fix” or “solve”) and (“bug” or “issue” or “problem” or “error”). For each bug-fixing commit, they extract the source code before and after the fixing process by using the GitHub Compare API3 to collect the buggy (pre-commit) and the fixed (post-commit) codes. Subsequently, they normalize all the names of the variables and custom methods, which greatly limits the vocabulary size and enables the model to focus on learning bug-fixing patterns. Then, they filter out the pairs that contain lexical or syntactic errors in either the buggy or fixed code, as well as the pairs with more than 100 atomic AST (Abstract Syntax Trees) modification actions between the buggy and the fixed versions. To achieve this, they employ the GumTree Spoon AST Diff tool [5]. 1 https://www.microsoft.com/ https://www.gharchive.org/ 3 https://developer.github.com/v3/repos/commits/compare-two-commits 2 A survey of Code Refinement Task Finally, they divide the whole dataset into two subsets (small with tokens ≤ 50 and medium with tokens 50 and ≤ 100) based on the code length. For the small subset, the numbers of training, development, and test samples are 46680, 5835, and 5835, respectively. For the medium subset, the numbers are 52364, 6545, and 6545, respectively. 2.2 Metrics overview In this section, we will describe 3 metrics that are used to evaluate this problem and provide mathematical formulas for their calculation. 2.2.1 BLEU BLEU [6] is computed using a couple of ngram modified precisions. Specifically, ! N X BLEU = BP · exp wn · log pn , (1) n=1 where pn is the modified precision for ngram,the base of log is the natural base e, wn is weight between 0 and 1 for PN log pn and n=1 wn = 1, and BP is the brevity penalty to penalize short machine translations. BP = 1, if c > r , r exp 1 − c , if c ≤ r (2) where c is the number of unigrams (length) in all the candidate sentences, and r is the best match lengths for each candidate sentence in the corpus. Here the best match length is the closest reference sentence length to the candidate sentences. It is not hard to find that BLEU is always a value between 0 and 1. It is because BP , wn and pn are always between 0 and 1, and ! N N N N Y X Y Y wn n exp wn · log pn = exp(wn · log pn ) = [exp(log pn )] = pw (3) n ∈ [0, 1]. n=1 Usually, BLEU uses N = 4 and wn = 2.2.2 n=1 n=1 n=1 1 N. Exact Match Accuracy For each pair p = (precommit, postcommit) in the dataset, Exact Match Accuracy (EMA) is calculated as: ! N X EM A = EM (samplen ) /N, (4) n=1 where N is the number of pairs in the dataset and EM is calculated as: 1, if LM (precommit) = postcommit EM = , 0, otherwise (5) where LM is the Language Model. 2.2.3 CodeBLEU In order to pay attention to the keywords, leverage the tree structure and consider the semantic logic information, they proposed a new evaluation metric CodeBLEU [7] defined as the weighted combination of four parts as shown in Figure 1: CodeBLEU = α · BLEU + β · BLEUweight + γ · M atchast + δ · M atchdf , (6) where BLEU is calculated by standard BLEU (Sec. 2.2.1), BLEUweight is the weighted ngram match, obtained by comparing the hypothesis code and the reference code tokens with different weights, M atchast is the syntactic AST match, exploring the syntactic information of code, and M atchdf is the semantic data-flow match, considering the semantic similarity between the hypothesis and the reference. The weighted ngram match and the syntactic AST match are used to measure grammatical correctness, and the semantic data-flow match is used to calculate logic correctness. The authors of the article recommend using the combination (0.10, 0.10, 0.40, 0.40) for α, β, γ, δ, respectively. 2 A survey of Code Refinement Task The original BLEU [6] compares ngrams between the candidate and the reference, and calculates the ratio of matched ngrams. Compared with natural languages which a huge vocabulary and a free word order, programming languages are manually designed and have only a few keywords such as “int”, “public” and so on. Applying the traditional BLEU directly to code synthesis will ignore the importance of the keywords. Hence, they introduced the weighted ngram match to assign different weights for different ngrams, so that the keywords may have higher weights, as shown in Figure 1. Figure 1: The proposed CodeBLEU, a weighted syntactic and semantic BLEU for code synthesis evaluation, consists of the original BLEU, the weighted ngram match, the syntactic AST match, and the semantic data-flow match. The weighted ngram match precision is computed as: P pn = l P C∈Candidates i=1 l P P C ′ ∈Candidates i=1 µin · Countclip (C(i, i + n)) , µin · Countclip (C ′ (i, i (7) + n)) where n means the length of the ngram, C(i, i + n) is the ngram from the position i to the position i + n, and Countclip (C(i, i + n)) is the maximum number of ngrams co-occurring in a candidate code and a set of reference codes. µin denotes the weights of different keywords or ngram. In [7], µin of the keywords is 5 times the weights of other tokens. Similar to the original BLEU, they calculate brevity penalty according to (2). The weighted n-gram match score is calculated as: BLEUweight = BP · exp N X ! wn · log pn . (8) n=1 In [7], the keywords are only considered in the unigrams, so N and wn are equal to 1. Note that a keywords list is predefined for each programming language. In addition to the sequence-level matching, they also considered the syntactic information in CodeBLEU [7] by matching the tree structure. Different from natural language, programming language has natural tree structures, such as the abstract syntax tree (AST). AST is a tree representation of the abstract syntactic structure of programming languages. We can obtain all the sub-trees of the tree-sitter parsing result4 , then calculate the accuracy by comparing the candidate and reference sub-trees. In AST, each node denotes a construct occurring in the source code. The leaves of AST represent the names of the function and all the variables. However, we just want to use the syntactic structure of the codes, and the naming is not important, thus they left out all the leave nodes in the original AST trees. 4 https://github.com/tree-sitter/tree-sitter 3 A survey of Code Refinement Task As shown in the middle part of Figure 1, they extract all the sub-trees of the candidate and the reference ASTs respectively. Then they calculate the syntactic AST match score as: M atchast = Countclip (Tcand )/Count(Tref ), (9) where Count(Tref ) is the total number of the reference subtrees, and Countclip (Tcand ) is the number of the candidate subtrees that are matched the reference. This score can evaluate code quality from a syntactic perspective, because grammatical errors such as token missing, data type errors can be captured by the difference between their ASTs. Therefore, we also consider the semantic information in CodeBLEU. They use data-flow [8] to represent a source code as a graph, in which nodes represent variables and edges represent where the value of each variable comes from. Based on the above, there are three steps to compute the semantic data-flow match score. Step 1: Obtain the data-flow graphs for the candidate and the reference. Based on AST, they first utilize the leaves to identify variable sequence, denoted as V = {v0 , v1 , ..., vm }. They then take each variable as a node of the graph and a directed edge ϵ = ⟨vi , vj ⟩ from vi to vj refers that the value of j-th variable comes from i-th variable. The graph G(C) = (V ; E) is used to represent relations among variables of the code C, as shown by the red arrows in Figure 1. Step 2: Normalize data-flow items. For simplicity and unity, they ignore the variable position and normalize their names. they collect all the variables in the data-flow items and rename them vari , where i is the order of the variables appearing in all data-flow items. Step 3: Calculate the semantic data-flow match score as: M atchdf = Countclip (DFcand )/Count(DFref ), (10) where Count(DFref ) is the total of the reference data-flows, and Countclip (DFcand ) is the number of matched candidate data-flows. 3 Existing methods For the convenience of understanding technical progress, in this section we will briefly review the solutions in their chronological order as presented in the CodeXGLEU benchmark. 3.1 Benchmark’s Baseline models The authors of the benchmark article [2] provided 4 baseline models so that you can compare your results with their models. The Naive method directly copies the buggy code as the repair result. As for Transformer [9], they used the same number of layers and hidden size as the pretrained models. They also used vanilla LSTM [10] to solve this problem. With regard to the CodeBERT [11] method, they initialized the encoder using CodeBERT and use a randomly initialized Transformer with 6 layers, 768 dimensional hidden states and 12 attention heads as the decoder, see Figure 2. CodeBERT is a bimodal pretrained model based on Transformer with 12 layers, 768 dimensional hidden states, and 12 attention heads for programming language (PL) and natural language (NL). Feng et al. [11] pretrain CodeBERT by masked language modeling and replaced token detection objectives on the CodeSearchNet dataset [12], which includes 2.4M functions with document pairs for six programming languages. The model supports different types of the sequence input like text/code and code/code with a special token [CLS] in front of the sequence and a special symbol [SEP] to split two kinds of data types. Then they used the training data to fine-tune the whole model. Figure 2: Pipeline for the EncoderDecoder framework. 3.2 PLBART model PLBART [13] uses the same architecture as BARTbase [14], it uses the sequenceto-sequence Transformer architecture [9], with 6 layers of encoder and 6 layers of decoder with model dimension of 768 and 12 heads (∼140M parameters). The only exception is, they included an additional layernormalization layer on top of both the encoder and decoder following [15], which is found to stabilize training with FP16 precision. 4 A survey of Code Refinement Task They add a noise function in autoencoder, a model learns to reconstruct an input text that is corrupted by a noise function. Reconstruction of the original input requires the model to learn language syntax and semantics. They used three noising strategies: token masking, token deletion, and token infilling [14]. According to the first two strategies, random tokens are sampled and replaced with a mask token or deleted from the input sequence. In token infilling, a number of text spans are sampled and replaced with a single mask token. The span lengths are drawn from a Poisson distribution (λ = 3.5). They masked 35% of the tokens in each instance. 3.3 CoTexT model CoTexT [16] follows the sequence-to-sequence encoder-decoder architecture proposed by [9]. They initialized the Base T5 model released by [17] which has 220 million parameters. They trained the model with a 0.001 learning rate and an input/target length of 1024. The model is trained with maximum likelihood objective (that is using ”teacher forcing” [18]) regardless of the text-code or code-text tasks. Therefore, for CoTexT, they leveraged the potential for Multi-Task learning [17] to complete both text-code and code-text generation on Code Summarization and Code Refinement tasks. To specify the task their model should perform, they simply added a task-specific prefix to the input sequence. For example, when fine-tuning of the Code Summarization task for each programming language, they simply prepended a prefix for each PL name (i.e., Java) to the input sequence. 3.4 NSEdit model They used the Transformer model [9] to perform sequence-to-sequence prediction. The NSEdit model [19] computes f (x) = ê, where x is the buggy token sequence and ê is the predicted editing sequence. The encoder processes buggy code x and outputs the encoder memory m, formally shown in Equation 11. For input x with L tokens and model with h hidden units, the encoder memory has shape (L, h), omitting the batch dimension. The decoder takes m and the current editing sequence token ei as the input and autoregressively predicts the next token ei+1 by maximum likelihood, as shown in Equation 12 and 13, where [·] denotes the slicing operator in Python. m = encoder(x), (11) Êi+1 = decoder(m, ei ), (12) êi+1 = arg max Êi+1 [w]. (13) w∈W They used teacher forcing as the training procedure [18, 20]. This means that in Equation 12, the ground truth edit token ei is inputted into the decoder, but not the predicted token eˆi . They fine-tuned pre-trained CodeBERT [11] and CodeGPT [2] on dataset Bugs2Fix [3]. They modify the decoder to have two modes, a word/action mode that predicts edit actions and inserted words, and a location mode that predicts edit locations. The original CodeBERT tokenizer has 50265 word tokens in the vocabulary, and we add [DELET E] and [IN SERT ] tokens to the vocabulary. When predicting words or actions, the decoder outputs a probability vector ŵ over a set W of 50267 elements by passing the logits output c into the softmax function, shown in Equation 14. exp(c) . j∈W exp(ci ) Ŵ = P (14) When predicting locations, instead of further expanding the vocabulary to add 513 location tokens and predict them along with words and actions, the decoder uses a pointer network in place of the last layer of the decoder (Figure 3). The pointer network is a feed forward neural network. It transforms the output from the penultimate layer of the decoder into a latent representation v [21, 22]. In order to determine the location of the edit, they computed the dot product between v and m before a softmax function over all edit locations, as shown in equation 15. As the result, the pointer network outputs a probability vector L̂ over all edit locations at index 0, 1, 2...L for a buggy code with L tokens. exp(v T m) L̂ = PL . T j=0 exp(v mj ) (15) Since ground truth is available with teacher forcing, they determined which decoder mode to use given the type of ground truth token ei+1 . 5 A survey of Code Refinement Task Figure 3: An illustration of the main NSEdit model architecture. There are two modes in the decoder. One mode predicts words and actions, and the other mode selects locations with a pointer network. The pointer network takes the penultimate layer output of the decoder and compares it with the encoder memory by dot product in order to select edit location. They sliced the encoder memory as the embedding m[l] to replace the embedding of a location token [LOCl] as the input to the decoder in Equation 12. As the result, the input m[l] and output v of the decoder for locations are both content-based representations, rather than a fixed location embedding that does not change when location context changes with the input program. They used cross-entropy loss for both word/action prediction and location prediction and add them together with equal coefficients. They also used beam search [23, 24, 25] to generate the top-5 editing sequences (hypotheses). They trained two rerankers with different architectures to classify which editing sequence is correct among the beam search hypotheses. The two rerankers and the original beam search score are combined with an ensemble model to produce the final reranking. Lastly, they fine-tuned the rerankers on the beam search hypotheses on the validation set to reduce over-fitting. The beam search hypotheses reranked by the fine-tuned ensemble are the final predictions. In Figure 4 you can see which rule the NSEdit model is learning to predict. 4 Comparison of models In this section, we will compare models according to various criteria. 4.1 Programming languages The models presented in the benchmark were trained on data containing various programming languages, such as: Java, Python, Javascript, PHP, Ruby and Go. For more details, the distribution by model can be found in the table 1. In this case we have marked "None" those methods that have no teachable parameters. As you can see all the models have the Java programming language, this can be explained by the fact that the dataset for evaluation consists of only one programming language – Java. 6 A survey of Code Refinement Task Figure 4: The transition diagram of the finite state machine for the NSEdit grammar used to generate editing sequences. The start state is the state BOS. The accept state is the state EOS. Model Programming languages NSEdit Java CoTexT Java, Python, Javascript, PHP, Ruby, Go CodeBERT Java PLBART Java, Python Transformer Java LSTM Java Naive copy None Table 1: Programming languages. 4.2 Pretrained tasks In this section we will try to describe on which tasks the encoder and decoder architectures presented in the benchmark were pretrained, see Table 2. In some cases it is not possible to distinguish Encoder and Decoder as separate architectures, for them we have merged the corresponding columns together. Also, we can allocate pretraining tasks only for architectures of the Transformer type, for other architectures we mark the pretraining tasks – "None". Model NSEdit CoTexT CodeBERT PLBART Transformer LSTM Naive copy Pretrained tasks Encoder Masked Language Modeling, Replaced Token Detection Masked Language Modeling Masked Language Modeling, Replaced Token Detection Token masking, Token deletion, Token infilling w/o pretraining None None Table 2: Pretrained tasks. 7 Decoder Next Token Prediction w/o pretraining A survey of Code Refinement Task 4.3 Parameters In this section, we will give a detailed description of the Transformer architectures that are present in the benchmark. The description will contain: Number of layers, Max length of position, Embedding size, Attention heads, Attention head size, Vocabulary size and Total number of parameters. As you can see, almost all models have the same scale compared to other models. Number of layers Max length of position Embedding size Attention heads Attention head size Vocabulary size Total number of parameters 4.4 CodeBERT PLBART CoTexT 12 6 12 512 1024 1024 768 768 768 12 12 12 64 64 64 50,265 50,004 125M 140M 220M Table 3: Parameters. NSEdit 24 512 768 12 64 50,267 249M Transformer 12 512 768 12 64 32,000 220M Metrics The benchmark provides an opportunity to send your results according to certain rules to their verification system so that you can objectively compare the results of models. As mentioned earlier, this task is evaluated by 3 quality metrics: BLEU, Exact Match Accuracy and CodeBLUE, see (Sec. 2.2.1, Sec. 2.2.2, Sec. 2.2.3). Table 4 shows the results of the benchmark evaluation. Model NSEdit CoTexT CodeBERT PLBART Transformer LSTM Naive copy Organization NSEdit Team Case Western Reserve University UCLA & Columbia University CodeXGLUE Team CodeXGLUE Team CodeXGLUE Team CodeXGLUE Team small test set Acc(%) CodeBLEU 24.04 / BLEU 85.72 medium test set Acc(%) CodeBLEU 13.87 / Date 2021-11-18 BLEU 71.06 2021-04-23 77.91 22.64 76.43 88.31 15.36 84.53 2020-08-30 77.42 16.40 75.58 91.07 5.16 87.52 2021-04-02 77.02 19.21 / 88.50 8.98 / 2020-08-30 77.21 14.70 73.31 89.25 3.70 81.72 2020-08-30 76.76 10.00 / 72.08 2.50 / 2020-08-30 78.06 0 / 90.91 0 / Table 4: Results on the code repair task. 4.5 Complexity In the benchmark there are only Recurrent and Transformers models, in which the main building blocks are the Recurrent Cell and Self-Attention, respectively. We think it is important to compare in complexity only the building blocks, not each model individually. Table 5 compares these building blocks in terms of Complexity per Layer, Sequential Operations, and Maximum Path Length. 5 Discussion We were very inspired by the latest development [19]. The correct formulation of the hypothesis that transformers tend to learn to remember specific data (programming language), and the unique approach to test it: developing our own regular edit language and using it as a target for training the transformer, led to the best results in the benchmark and 8 A survey of Code Refinement Task Layer Type Reccurent Self-Attentin Complexity per Layer O(T · D2 ) O(T 2 · D) Sequential Operations O(T ) O(1) Maximum Path Length O(T ) O(1) Table 5: Per-layer complexity, minimum number of sequential operations and maximum path lengths for different layer types. T is the sequence length, D is the representation dimension. generated a number of future studies in this topic. After reviewing all the research, using Transformer to predict a regular edit language seems to be the most successful vector for the development of the code refinement task. At this moment, the results obtained are still far from ideal. As is well known, the quality of a Transformer model is highly dependent on the scale of the model. Advanced companies such as Open-AI5 , Google6 , Microsoft7 , Meta8 often release pretrained models in several configurations: small, base, large, xlarge, etc. The largest of which have several billion parameters and perform better in tests. As an improvement to the current result we propose to increase the number of trained parameters and layers within the neural network. Because of this there may be problems with a sufficient amount of training data, so we propose to develop a converter that will translate the code from a specific programming language into pseudo code and on pairs (pseudo code with bugs, edits to fix the bug) to train the Transformer model. This approach will increase the training sample by several times, because we will be able to use any programming language for training. As an example we give a possible inference pipeline for such architecture, see Figure 5. Figure 5: Possible inference pipeline to improve results. 6 Conclusion In this paper, we gave a description of the problem, gave a detailed description of the dataset that is used to evaluate models, and disassembled which metrics are used for evaluation and how they are calculated. We also gave brief descriptions of existing methods and compared them from different points of view. At the end of the work we have tried to speculate in what way the current result can be improved. References [1] Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. Gleu: Automatic evaluation of sentence-level fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351, 2007. [2] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021. [3] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions on Software Engineering and Methodology (TOSEM), 28(4):1–29, 2019. 5 https://openai.com/ https://www.google.com/ 7 https://www.microsoft.com/ 8 https://www.meta.com/ 6 9 A survey of Code Refinement Task [4] Michael Fischer, Martin Pinzger, and Harald Gall. Populating a release history database from version control and bug tracking systems. In International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings., pages 23–32. IEEE, 2003. [5] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. Fine-grained and accurate source code differencing. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering, pages 313–324, 2014. [6] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002. [7] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297, 2020. [8] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020. [9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [11] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020. [12] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019. [13] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333, 2021. [14] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019. [15] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020. [16] Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. Cotext: Multi-task learning with code-text transformer. arXiv preprint arXiv:2105.08645, 2021. [17] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019. [18] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989. [19] Yaojie Hu, Xingjian Shi, Qiang Zhou, and Lee Pike. Fix bugs with transformer through a neural-symbolic edit grammar. arXiv preprint arXiv:2204.06643, 2022. [20] Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in neural information processing systems, 29, 2016. [21] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. Advances in neural information processing systems, 28, 2015. [22] Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. Neural program repair by jointly learning to localize and repair. arXiv preprint arXiv:1904.01720, 2019. [23] D Raj Reddy et al. Speech understanding systems: A summary of results of the five-year research effort. Department of Computer Science. Camegie-Mell University, Pittsburgh, PA, 17:138, 1977. [24] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012. [25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014. 10