Uploaded by Олег Павлович

A survey of Code Refinement Task

advertisement
A SURVEY OF C ODE R EFINEMENT TASK
Mark Baushenko
Moscow
m.baushenko@gmail.com
A BSTRACT
In this article, we will provide a detailed description of existing methods for solving the Code
Refinement problem, analyze existing metrics for evaluating the quality of solution methods and try
to suggest possible ways to improve existing methods.
1
Introduction
In the field of natural language processing, new solutions or upgrades of existing ones appear every day, and it becomes
difficult for scientists from all over the world to monitor progress. Therefore, in 2019, the GLUE (General Language
Understanding Evaluation) [1] benchmark was created. It includes all the main tasks for natural language processing
and, accordingly, data sets for each of the tasks. In order for all participants of the benchmark to be in the same
conditions, the organizers have introduced a certain set of rules, observing which, it will be necessary to submit their
solutions for evaluation. It became obvious that many approaches and ways of solving problems for natural language
processing can be projected for programming language processing tasks, because they are both built on the basis of
certain rules and relationships. And after the appearance of the GLUE benchmark, colleagues from Microsoft1 created
a benchmark for processing the programming language – CodeXGLUE [2]. It includes 11 tasks for different types
of input and output sequences: Code-Code, Text-Code, Code-Text, Text-Text. In this work, we will focus on one
of the task - Code Refinement [3], and compare the existing solutions according to the following criteria: metrics,
programming language used in training, tasks on which the preliminary training was performed, the content of the
architecture and the number of parameters in it, the number of operations for calculations and memory usage.
2
Task description
Localization and bug fixing is known to be a difficult and time-consuming task for software developers. The purpose of
this task is to find an error in the code and fix it automatically. To solve this task, it is proposed to use architectures of
the Encoder-Decoder type.
2.1
Datasets overview
They used the dataset released by Tufano et al. [3] which is called Bugs2Fix. The source is buggy Java functions,
whereas the target is the corresponding fixed functions. To build this dataset, they first download every public GitHub
event between March 2011 and October 2017 from GitHub Archive2 and use the Google BigQuery APIs to identify all
Java-file commits having a message containing the patterns [4]: (“fix” or “solve”) and (“bug” or “issue” or “problem”
or “error”). For each bug-fixing commit, they extract the source code before and after the fixing process by using
the GitHub Compare API3 to collect the buggy (pre-commit) and the fixed (post-commit) codes. Subsequently, they
normalize all the names of the variables and custom methods, which greatly limits the vocabulary size and enables the
model to focus on learning bug-fixing patterns. Then, they filter out the pairs that contain lexical or syntactic errors in
either the buggy or fixed code, as well as the pairs with more than 100 atomic AST (Abstract Syntax Trees) modification
actions between the buggy and the fixed versions. To achieve this, they employ the GumTree Spoon AST Diff tool [5].
1
https://www.microsoft.com/
https://www.gharchive.org/
3
https://developer.github.com/v3/repos/commits/compare-two-commits
2
A survey of Code Refinement Task
Finally, they divide the whole dataset into two subsets (small with tokens ≤ 50 and medium with tokens 50 and ≤ 100)
based on the code length. For the small subset, the numbers of training, development, and test samples are 46680, 5835,
and 5835, respectively. For the medium subset, the numbers are 52364, 6545, and 6545, respectively.
2.2
Metrics overview
In this section, we will describe 3 metrics that are used to evaluate this problem and provide mathematical formulas for
their calculation.
2.2.1
BLEU
BLEU [6] is computed using a couple of ngram modified precisions. Specifically,
!
N
X
BLEU = BP · exp
wn · log pn ,
(1)
n=1
where pn is the modified precision for ngram,the base of log is the natural base e, wn is weight between 0 and 1 for
PN
log pn and n=1 wn = 1, and BP is the brevity penalty to penalize short machine translations.
BP =
1,
if c > r
,
r
exp 1 − c , if c ≤ r
(2)
where c is the number of unigrams (length) in all the candidate sentences, and r is the best match lengths for each
candidate sentence in the corpus. Here the best match length is the closest reference sentence length to the candidate
sentences.
It is not hard to find that BLEU is always a value between 0 and 1. It is because BP , wn and pn are always between 0
and 1, and
!
N
N
N
N
Y
X
Y
Y
wn
n
exp
wn · log pn =
exp(wn · log pn ) =
[exp(log pn )] =
pw
(3)
n ∈ [0, 1].
n=1
Usually, BLEU uses N = 4 and wn =
2.2.2
n=1
n=1
n=1
1
N.
Exact Match Accuracy
For each pair p = (precommit, postcommit) in the dataset, Exact Match Accuracy (EMA) is calculated as:
!
N
X
EM A =
EM (samplen ) /N,
(4)
n=1
where N is the number of pairs in the dataset and EM is calculated as:
1, if LM (precommit) = postcommit
EM =
,
0, otherwise
(5)
where LM is the Language Model.
2.2.3
CodeBLEU
In order to pay attention to the keywords, leverage the tree structure and consider the semantic logic information, they
proposed a new evaluation metric CodeBLEU [7] defined as the weighted combination of four parts as shown in Figure
1:
CodeBLEU = α · BLEU + β · BLEUweight + γ · M atchast + δ · M atchdf ,
(6)
where BLEU is calculated by standard BLEU (Sec. 2.2.1), BLEUweight is the weighted ngram match, obtained by
comparing the hypothesis code and the reference code tokens with different weights, M atchast is the syntactic AST
match, exploring the syntactic information of code, and M atchdf is the semantic data-flow match, considering the
semantic similarity between the hypothesis and the reference. The weighted ngram match and the syntactic AST match
are used to measure grammatical correctness, and the semantic data-flow match is used to calculate logic correctness.
The authors of the article recommend using the combination (0.10, 0.10, 0.40, 0.40) for α, β, γ, δ, respectively.
2
A survey of Code Refinement Task
The original BLEU [6] compares ngrams between the candidate and the reference, and calculates the ratio of matched
ngrams. Compared with natural languages which a huge vocabulary and a free word order, programming languages are
manually designed and have only a few keywords such as “int”, “public” and so on. Applying the traditional BLEU
directly to code synthesis will ignore the importance of the keywords. Hence, they introduced the weighted ngram
match to assign different weights for different ngrams, so that the keywords may have higher weights, as shown in
Figure 1.
Figure 1: The proposed CodeBLEU, a weighted syntactic and semantic BLEU for code synthesis evaluation, consists of
the original BLEU, the weighted ngram match, the syntactic AST match, and the semantic data-flow match.
The weighted ngram match precision is computed as:
P
pn =
l
P
C∈Candidates i=1
l
P
P
C ′ ∈Candidates i=1
µin · Countclip (C(i, i + n))
,
µin
· Countclip
(C ′ (i, i
(7)
+ n))
where n means the length of the ngram, C(i, i + n) is the ngram from the position i to the position i + n, and
Countclip (C(i, i + n)) is the maximum number of ngrams co-occurring in a candidate code and a set of reference
codes. µin denotes the weights of different keywords or ngram. In [7], µin of the keywords is 5 times the weights of
other tokens. Similar to the original BLEU, they calculate brevity penalty according to (2).
The weighted n-gram match score is calculated as:
BLEUweight = BP · exp
N
X
!
wn · log pn
.
(8)
n=1
In [7], the keywords are only considered in the unigrams, so N and wn are equal to 1. Note that a keywords list is
predefined for each programming language.
In addition to the sequence-level matching, they also considered the syntactic information in CodeBLEU [7] by matching
the tree structure. Different from natural language, programming language has natural tree structures, such as the
abstract syntax tree (AST). AST is a tree representation of the abstract syntactic structure of programming languages.
We can obtain all the sub-trees of the tree-sitter parsing result4 , then calculate the accuracy by comparing the candidate
and reference sub-trees. In AST, each node denotes a construct occurring in the source code. The leaves of AST
represent the names of the function and all the variables. However, we just want to use the syntactic structure of the
codes, and the naming is not important, thus they left out all the leave nodes in the original AST trees.
4
https://github.com/tree-sitter/tree-sitter
3
A survey of Code Refinement Task
As shown in the middle part of Figure 1, they extract all the sub-trees of the candidate and the reference ASTs
respectively. Then they calculate the syntactic AST match score as:
M atchast = Countclip (Tcand )/Count(Tref ),
(9)
where Count(Tref ) is the total number of the reference subtrees, and Countclip (Tcand ) is the number of the candidate
subtrees that are matched the reference. This score can evaluate code quality from a syntactic perspective, because
grammatical errors such as token missing, data type errors can be captured by the difference between their ASTs.
Therefore, we also consider the semantic information in CodeBLEU. They use data-flow [8] to represent a source code
as a graph, in which nodes represent variables and edges represent where the value of each variable comes from. Based
on the above, there are three steps to compute the semantic data-flow match score.
Step 1: Obtain the data-flow graphs for the candidate and the reference. Based on AST, they first utilize the leaves to
identify variable sequence, denoted as V = {v0 , v1 , ..., vm }. They then take each variable as a node of the graph and
a directed edge ϵ = ⟨vi , vj ⟩ from vi to vj refers that the value of j-th variable comes from i-th variable. The graph
G(C) = (V ; E) is used to represent relations among variables of the code C, as shown by the red arrows in Figure 1.
Step 2: Normalize data-flow items. For simplicity and unity, they ignore the variable position and normalize their
names. they collect all the variables in the data-flow items and rename them vari , where i is the order of the variables
appearing in all data-flow items.
Step 3: Calculate the semantic data-flow match score as:
M atchdf = Countclip (DFcand )/Count(DFref ),
(10)
where Count(DFref ) is the total of the reference data-flows, and Countclip (DFcand ) is the number of matched
candidate data-flows.
3
Existing methods
For the convenience of understanding technical progress, in this section we will briefly review the solutions in their
chronological order as presented in the CodeXGLEU benchmark.
3.1
Benchmark’s Baseline models
The authors of the benchmark article [2] provided 4 baseline models so that
you can compare your results with their models. The Naive method directly
copies the buggy code as the repair result. As for Transformer [9], they used
the same number of layers and hidden size as the pretrained models. They also
used vanilla LSTM [10] to solve this problem. With regard to the CodeBERT
[11] method, they initialized the encoder using CodeBERT and use a randomly
initialized Transformer with 6 layers, 768 dimensional hidden states and 12
attention heads as the decoder, see Figure 2. CodeBERT is a bimodal pretrained
model based on Transformer with 12 layers, 768 dimensional hidden states, and
12 attention heads for programming language (PL) and natural language (NL).
Feng et al. [11] pretrain CodeBERT by masked language modeling and replaced
token detection objectives on the CodeSearchNet dataset [12], which includes
2.4M functions with document pairs for six programming languages. The model
supports different types of the sequence input like text/code and code/code with a
special token [CLS] in front of the sequence and a special symbol [SEP] to split
two kinds of data types. Then they used the training data to fine-tune the whole
model.
Figure 2: Pipeline for the EncoderDecoder framework.
3.2 PLBART model
PLBART [13] uses the same architecture as BARTbase [14], it uses the sequenceto-sequence Transformer architecture [9], with 6 layers of encoder and 6 layers
of decoder with model dimension of 768 and 12 heads (∼140M parameters). The only exception is, they included an
additional layernormalization layer on top of both the encoder and decoder following [15], which is found to stabilize
training with FP16 precision.
4
A survey of Code Refinement Task
They add a noise function in autoencoder, a model learns to reconstruct an input text that is corrupted by a noise
function. Reconstruction of the original input requires the model to learn language syntax and semantics. They used
three noising strategies: token masking, token deletion, and token infilling [14]. According to the first two strategies,
random tokens are sampled and replaced with a mask token or deleted from the input sequence. In token infilling, a
number of text spans are sampled and replaced with a single mask token. The span lengths are drawn from a Poisson
distribution (λ = 3.5). They masked 35% of the tokens in each instance.
3.3
CoTexT model
CoTexT [16] follows the sequence-to-sequence encoder-decoder architecture proposed by [9]. They initialized the Base
T5 model released by [17] which has 220 million parameters. They trained the model with a 0.001 learning rate and an
input/target length of 1024.
The model is trained with maximum likelihood objective (that is using ”teacher forcing” [18]) regardless of the text-code
or code-text tasks. Therefore, for CoTexT, they leveraged the potential for Multi-Task learning [17] to complete both
text-code and code-text generation on Code Summarization and Code Refinement tasks. To specify the task their model
should perform, they simply added a task-specific prefix to the input sequence. For example, when fine-tuning of the
Code Summarization task for each programming language, they simply prepended a prefix for each PL name (i.e., Java)
to the input sequence.
3.4
NSEdit model
They used the Transformer model [9] to perform sequence-to-sequence prediction. The NSEdit model [19] computes
f (x) = ê, where x is the buggy token sequence and ê is the predicted editing sequence. The encoder processes buggy
code x and outputs the encoder memory m, formally shown in Equation 11. For input x with L tokens and model with
h hidden units, the encoder memory has shape (L, h), omitting the batch dimension. The decoder takes m and the
current editing sequence token ei as the input and autoregressively predicts the next token ei+1 by maximum likelihood,
as shown in Equation 12 and 13, where [·] denotes the slicing operator in Python.
m = encoder(x),
(11)
Êi+1 = decoder(m, ei ),
(12)
êi+1 = arg max Êi+1 [w].
(13)
w∈W
They used teacher forcing as the training procedure [18, 20]. This means that in Equation 12, the ground truth edit
token ei is inputted into the decoder, but not the predicted token eˆi . They fine-tuned pre-trained CodeBERT [11] and
CodeGPT [2] on dataset Bugs2Fix [3]. They modify the decoder to have two modes, a word/action mode that predicts
edit actions and inserted words, and a location mode that predicts edit locations.
The original CodeBERT tokenizer has 50265 word tokens in the vocabulary, and we add [DELET E] and [IN SERT ]
tokens to the vocabulary. When predicting words or actions, the decoder outputs a probability vector ŵ over a set W of
50267 elements by passing the logits output c into the softmax function, shown in Equation 14.
exp(c)
.
j∈W exp(ci )
Ŵ = P
(14)
When predicting locations, instead of further expanding the vocabulary to add 513 location tokens and predict them
along with words and actions, the decoder uses a pointer network in place of the last layer of the decoder (Figure 3).
The pointer network is a feed forward neural network. It transforms the output from the penultimate layer of the decoder
into a latent representation v [21, 22]. In order to determine the location of the edit, they computed the dot product
between v and m before a softmax function over all edit locations, as shown in equation 15. As the result, the pointer
network outputs a probability vector L̂ over all edit locations at index 0, 1, 2...L for a buggy code with L tokens.
exp(v T m)
L̂ = PL
.
T
j=0 exp(v mj )
(15)
Since ground truth is available with teacher forcing, they determined which decoder mode to use given the type of
ground truth token ei+1 .
5
A survey of Code Refinement Task
Figure 3: An illustration of the main NSEdit model architecture. There are two modes in the decoder. One mode
predicts words and actions, and the other mode selects locations with a pointer network. The pointer network takes the
penultimate layer output of the decoder and compares it with the encoder memory by dot product in order to select edit
location.
They sliced the encoder memory as the embedding m[l] to replace the embedding of a location token [LOCl] as the
input to the decoder in Equation 12. As the result, the input m[l] and output v of the decoder for locations are both
content-based representations, rather than a fixed location embedding that does not change when location context
changes with the input program. They used cross-entropy loss for both word/action prediction and location prediction
and add them together with equal coefficients.
They also used beam search [23, 24, 25] to generate the top-5 editing sequences (hypotheses). They trained two
rerankers with different architectures to classify which editing sequence is correct among the beam search hypotheses.
The two rerankers and the original beam search score are combined with an ensemble model to produce the final
reranking. Lastly, they fine-tuned the rerankers on the beam search hypotheses on the validation set to reduce over-fitting.
The beam search hypotheses reranked by the fine-tuned ensemble are the final predictions.
In Figure 4 you can see which rule the NSEdit model is learning to predict.
4
Comparison of models
In this section, we will compare models according to various criteria.
4.1
Programming languages
The models presented in the benchmark were trained on data containing various programming languages, such as: Java,
Python, Javascript, PHP, Ruby and Go. For more details, the distribution by model can be found in the table 1. In this
case we have marked "None" those methods that have no teachable parameters.
As you can see all the models have the Java programming language, this can be explained by the fact that the dataset for
evaluation consists of only one programming language – Java.
6
A survey of Code Refinement Task
Figure 4: The transition diagram of the finite state machine for the NSEdit grammar used to generate editing sequences.
The start state is the state BOS. The accept state is the state EOS.
Model
Programming languages
NSEdit
Java
CoTexT
Java, Python, Javascript, PHP, Ruby, Go
CodeBERT
Java
PLBART
Java, Python
Transformer Java
LSTM
Java
Naive copy
None
Table 1: Programming languages.
4.2
Pretrained tasks
In this section we will try to describe on which tasks the encoder and decoder architectures presented in the benchmark
were pretrained, see Table 2. In some cases it is not possible to distinguish Encoder and Decoder as separate
architectures, for them we have merged the corresponding columns together. Also, we can allocate pretraining tasks
only for architectures of the Transformer type, for other architectures we mark the pretraining tasks – "None".
Model
NSEdit
CoTexT
CodeBERT
PLBART
Transformer
LSTM
Naive copy
Pretrained tasks
Encoder
Masked Language Modeling, Replaced Token Detection
Masked Language Modeling
Masked Language Modeling, Replaced Token Detection
Token masking, Token deletion, Token infilling
w/o pretraining
None
None
Table 2: Pretrained tasks.
7
Decoder
Next Token Prediction
w/o pretraining
A survey of Code Refinement Task
4.3
Parameters
In this section, we will give a detailed description of the Transformer architectures that are present in the benchmark.
The description will contain: Number of layers, Max length of position, Embedding size, Attention heads, Attention
head size, Vocabulary size and Total number of parameters. As you can see, almost all models have the same scale
compared to other models.
Number of layers
Max length of position
Embedding size
Attention heads
Attention head size
Vocabulary size
Total number of parameters
4.4
CodeBERT PLBART CoTexT
12
6
12
512
1024
1024
768
768
768
12
12
12
64
64
64
50,265
50,004
125M
140M
220M
Table 3: Parameters.
NSEdit
24
512
768
12
64
50,267
249M
Transformer
12
512
768
12
64
32,000
220M
Metrics
The benchmark provides an opportunity to send your results according to certain rules to their verification system so
that you can objectively compare the results of models. As mentioned earlier, this task is evaluated by 3 quality metrics:
BLEU, Exact Match Accuracy and CodeBLUE, see (Sec. 2.2.1, Sec. 2.2.2, Sec. 2.2.3). Table 4 shows the results of the
benchmark evaluation.
Model
NSEdit
CoTexT
CodeBERT
PLBART
Transformer
LSTM
Naive copy
Organization
NSEdit
Team
Case Western Reserve
University
UCLA
&
Columbia
University
CodeXGLUE
Team
CodeXGLUE
Team
CodeXGLUE
Team
CodeXGLUE
Team
small test set
Acc(%) CodeBLEU
24.04
/
BLEU
85.72
medium test set
Acc(%) CodeBLEU
13.87
/
Date
2021-11-18
BLEU
71.06
2021-04-23
77.91
22.64
76.43
88.31
15.36
84.53
2020-08-30
77.42
16.40
75.58
91.07
5.16
87.52
2021-04-02
77.02
19.21
/
88.50
8.98
/
2020-08-30
77.21
14.70
73.31
89.25
3.70
81.72
2020-08-30
76.76
10.00
/
72.08
2.50
/
2020-08-30
78.06
0
/
90.91
0
/
Table 4: Results on the code repair task.
4.5
Complexity
In the benchmark there are only Recurrent and Transformers models, in which the main building blocks are the Recurrent
Cell and Self-Attention, respectively. We think it is important to compare in complexity only the building blocks,
not each model individually. Table 5 compares these building blocks in terms of Complexity per Layer, Sequential
Operations, and Maximum Path Length.
5
Discussion
We were very inspired by the latest development [19]. The correct formulation of the hypothesis that transformers tend
to learn to remember specific data (programming language), and the unique approach to test it: developing our own
regular edit language and using it as a target for training the transformer, led to the best results in the benchmark and
8
A survey of Code Refinement Task
Layer Type
Reccurent Self-Attentin
Complexity per Layer O(T · D2 )
O(T 2 · D)
Sequential Operations
O(T )
O(1)
Maximum Path Length
O(T )
O(1)
Table 5: Per-layer complexity, minimum number of sequential operations and maximum path lengths for different layer
types. T is the sequence length, D is the representation dimension.
generated a number of future studies in this topic. After reviewing all the research, using Transformer to predict a
regular edit language seems to be the most successful vector for the development of the code refinement task.
At this moment, the results obtained are still far from ideal. As is well known, the quality of a Transformer model is
highly dependent on the scale of the model. Advanced companies such as Open-AI5 , Google6 , Microsoft7 , Meta8 often
release pretrained models in several configurations: small, base, large, xlarge, etc. The largest of which have several
billion parameters and perform better in tests. As an improvement to the current result we propose to increase the number
of trained parameters and layers within the neural network. Because of this there may be problems with a sufficient
amount of training data, so we propose to develop a converter that will translate the code from a specific programming
language into pseudo code and on pairs (pseudo code with bugs, edits to fix the bug) to train the Transformer model.
This approach will increase the training sample by several times, because we will be able to use any programming
language for training. As an example we give a possible inference pipeline for such architecture, see Figure 5.
Figure 5: Possible inference pipeline to improve results.
6
Conclusion
In this paper, we gave a description of the problem, gave a detailed description of the dataset that is used to evaluate
models, and disassembled which metrics are used for evaluation and how they are calculated. We also gave brief
descriptions of existing methods and compared them from different points of view. At the end of the work we have tried
to speculate in what way the current result can be improved.
References
[1] Andrew Mutton, Mark Dras, Stephen Wan, and Robert Dale. Gleu: Automatic evaluation of sentence-level fluency.
In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351, 2007.
[2] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn
Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding
and generation. arXiv preprint arXiv:2102.04664, 2021.
[3] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk.
An empirical study on learning bug-fixing patches in the wild via neural machine translation. ACM Transactions
on Software Engineering and Methodology (TOSEM), 28(4):1–29, 2019.
5
https://openai.com/
https://www.google.com/
7
https://www.microsoft.com/
8
https://www.meta.com/
6
9
A survey of Code Refinement Task
[4] Michael Fischer, Martin Pinzger, and Harald Gall. Populating a release history database from version control and
bug tracking systems. In International Conference on Software Maintenance, 2003. ICSM 2003. Proceedings.,
pages 23–32. IEEE, 2003.
[5] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. Fine-grained and
accurate source code differencing. In Proceedings of the 29th ACM/IEEE international conference on Automated
software engineering, pages 313–324, 2014.
[6] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
pages 311–318, 2002.
[7] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint
arXiv:2009.10297, 2020.
[8] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey
Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint
arXiv:2009.08366, 2020.
[9] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and
Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[11] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting
Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint
arXiv:2002.08155, 2020.
[12] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet
challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
[13] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program
understanding and generation. arXiv preprint arXiv:2103.06333, 2021.
[14] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves
Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
[15] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke
Zettlemoyer. Multilingual denoising pre-training for neural machine translation. Transactions of the Association
for Computational Linguistics, 8:726–742, 2020.
[16] Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. Cotext: Multi-task
learning with code-text transformer. arXiv preprint arXiv:2105.08645, 2021.
[17] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint
arXiv:1910.10683, 2019.
[18] Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks.
Neural computation, 1(2):270–280, 1989.
[19] Yaojie Hu, Xingjian Shi, Qiang Zhou, and Lee Pike. Fix bugs with transformer through a neural-symbolic edit
grammar. arXiv preprint arXiv:2204.06643, 2022.
[20] Alex M Lamb, Anirudh Goyal ALIAS PARTH GOYAL, Ying Zhang, Saizheng Zhang, Aaron C Courville,
and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in neural
information processing systems, 29, 2016.
[21] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. Advances in neural information processing
systems, 28, 2015.
[22] Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. Neural program repair by jointly
learning to localize and repair. arXiv preprint arXiv:1904.01720, 2019.
[23] D Raj Reddy et al. Speech understanding systems: A summary of results of the five-year research effort.
Department of Computer Science. Camegie-Mell University, Pittsburgh, PA, 17:138, 1977.
[24] Alex Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
[25] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in
neural information processing systems, 27, 2014.
10
Download