CodeMMLU Challenge Technical Report
Ngo Van Tuan Anh1
1
FPT University, Hanoi, Vietnam
1
anhnvthe186975@fpt.edu.vn
Abstract
This report details our approach to the CodeMMLU Challenge, which evaluates the ability of Large Language Models (LLMs) to understand and reason about code [1]. We leverage
a transformer-based sequence classification model, fine-tuned on the CodeMMLU dataset.
Our methodology emphasizes careful data preprocessing, targeted fine-tuning, and rigorous
evaluation. We present an analysis of the dataset characteristics and describe our data cleaning and augmentation strategies. We also detail our model selection process, including our
approach to mitigate overfitting by freezing layers and applying L2 regularization. Finally,
we propose future improvements, including the use of a Siamese network architecture and
advanced data augmentation techniques.
1
Introduction
Recent advancements in Code Large Language Models (CodeLLMs) have demonstrated capabilities across various software engineering (SE) tasks [2, 3]. However, benchmarks are evolving to
better assess true code understanding [4]. Nguyen et al. [1] introduced CodeMMLU to address
limitations in existing benchmarks that often focus on code generation [5, 6]. CodeMMLU is a
multi-choice question answering (MCQA) benchmark designed to evaluate the depth of software
and code understanding in LLMs. It includes questions sourced from diverse domains, encompassing tasks such as code analysis, defect detection, and software engineering principles across
multiple programming languages. This comprehensive approach enables a deeper assessment of
how CodeLLMs grasp coding concepts, moving beyond mere generation capabilities.
2
Methodology
Our methodology is organized into several key components, beginning with an in-depth dataset
analysis, followed by data preprocessing, model training, evaluation, and finally outlining potential improvements.
2.1
Dataset Analysis
An initial exploratory data analysis (EDA) revealed several critical issues within the training
data, as also noted by Nguyen et al. [1]. Firstly, the answer option ’E’ is extremely underrepresented, leading to an imbalanced distribution of answer choices. Additionally, a number
of questions are found to be missing a correct answer entirely. A significant proportion of the
questions are of a boolean nature (Yes/No or True/False), and there exists inconsistency in the
answer format, where some responses are prefixed with “ANSWER: ” while others are provided
as a single character (e.g., “A”). These observations guided the subsequent data preprocessing
and augmentation strategies.
1
(a)
(b)
Figure 1: (a) Number of choices per question and (b) Frequency of answer option.
2.2
Dataset Preprocessing
We utilize the CodeMMLU dataset [1] and address the identified data issues through several
preprocessing steps. Questions that contain the answer ’G’ or lack a correct answer are discarded
to ensure data integrity. To standardize the input, all answer choices are padded to a fixed
length of five by inserting the token “None” where necessary. For multiple-choice questions,
the dataset is augmented by shuffling the answer choices, with each question being shuffled 10
times to generate additional training examples. In contrast, boolean questions are augmented
by shuffling only between positions A and B, thus creating two new examples per instance.
Importantly, the augmentation is performed after splitting the data into training and validation
sets to prevent data leakage. Data handling is facilitated by a custom BaseCodeDataset class
and its subclasses (TrainDataset, ValDataset, and TestDataset), which manage data loading,
preprocessing, and batching. Key preprocessing tasks include tokenizing questions and answer
choices using the Qwen/Qwen2.5-Coder-0.5B-Instruct tokenizer [7], clipping input sequences to
a maximum length of 512 tokens to conserve computational resources, and appending special
tokens (e.g., <question>, </question>, <A>, </A>) to clearly delineate the different segments
of the input.
(a)
(b)
(c)
Figure 2: (a) Distribution of Question Token Lengths (clipped), (b) Distribution of Combined
clipped Choices Token Lengths, and (c) Distribution of Final Unpadded Sequence Lengths.
2.3
Model Architecture and Training
The core model employed is the Qwen/Qwen2.5-Coder-0.5B-Instruct, a transformer-based sequence classification model [7]. Although alternative models such as CodeBERT [8] and CodeT5
[3] were evaluated, the Qwen2.5-Coder model demonstrated superior performance for this task.
Fine-tuning is carried out using the AdamW optimizer [9], with most of the model’s layers frozen
to mitigate overfitting; only the final four layers are trained. Additionally, L2 regularization
is applied during training. Due to the rarity of the ’E’ answer option, the output classes are
2
reduced from five to four (A, B, C, D).
3
Evaluation and Results
Model performance is primarily assessed using accuracy metrics. The fine-tuned model generates predictions for the test dataset, and these results are exported to a CSV file for subsequent
analysis. Table 1 summarizes the evaluation results of various models. For instance, while
CodeBERT-base and CodeT5-base demonstrate moderate training and evaluation accuracy,
the Qwen-based models achieve significantly higher accuracies during training. However, the
gap between training/evaluation accuracy and submission accuracy suggests challenges in generalization.
Model
CodeBERT-base
CodeT5-base
Qwen2-0.5B-Instruct
Qwen2-0.5B-Instruct(U4)
Qwen2.5-Coder-0.5B-Instruct(U4)
Train
Accuracy
0.30
0.55
0.98
0.87
0.87
Evaluate
Accuracy
0.28
0.27
0.97
0.54
0.54
Submission
Accuracy
0.46
0.48
0.51
Table 1: Evaluation results of different models on the CodeMMLU Challenge. U4 denote
unfreeze 4 last layers
Preliminary experiments indicate promising performance, with the Qwen-based models
achieving high training and evaluation accuracies. Nonetheless, the relatively lower submission
accuracies suggest that further improvements in model generalization and data augmentation
techniques are needed.
4
Improvement Ideas
While the current approach has yielded encouraging results, several avenues for further improvement have been identified. One promising direction is to leverage advanced data augmentation
techniques using powerful LLMs, such as GPT-4 [10], to generate more diverse synthetic data.
Additionally, exploring a Siamese network architecture for answer evaluation may further enhance performance by enabling the model to assess the semantic similarity between generated
answers and the provided options. Other potential improvements include investigating augmentation methods like code mutation and context perturbation, as well as employing transfer
learning by pre-training on larger code-related datasets to enhance semantic understanding prior
to fine-tuning on CodeMMLU.
References
[1] Nguyen, D.M., et al. CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding Capabilities. arXiv preprint arXiv:2310.05279, 2024.
[2] Allal, L.B., et al. Code Generation with Large Language Models.
arXiv:2302.06695, 2023.
arXiv preprint
[3] Wang, Y., et al. CodeT5: Identifier-aware Unified Pre-training for Code and Text. In
Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 9439–
9449, 2021.
3
[4] Matton, N., et al. Can Large Language Models Truly Reason?
arXiv:2404.07336, 2024.
arXiv preprint
[5] Austin, J., et al. Program Synthesis with Large Language Models.
arXiv:2108.07732, 2021.
arXiv preprint
[6] Chen, M., et al. Evaluating Large Language Models on Code Generation. arXiv preprint
arXiv:2107.03374, 2021.
[7] Hu, B., et al. Qwen Technical Report. arXiv preprint arXiv:2309.16662, 2023.
[8] Feng, Z., et al. CodeBERT: A Pre-trained Model for Programming and Natural Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 1409–1419, 2020.
[9] Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization. In International
Conference on Learning Representations (ICLR), 2019.
[10] OpenAI. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.
[11] Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pp.
4171–4186, 2019.
[12] Vaswani, A., et al. Attention Is All You Need. In Advances in Neural Information Processing
Systems (NeurIPS), pp. 5998–6008, 2017.
4