[2403.17214] Output Format Biases in the Evaluation of Large Language Models for Code Translation

1 1 minute read

240317214 Output Format Biases in the Evaluation of Large Language.png

[Submitted on 25 Mar 2024 (v1), last revised 13 Oct 2025 (this version, v2)]

View PDF of the article “Output format biases in evaluating large language models for code translation,” by Marcos Macedo and 3 other authors

View PDF HTML (beta)

a summary:Translating code between programming languages (PLs) is a critical task in software engineering, which facilitates modernization of legacy systems, ensures cross-platform compatibility, and enhances software performance. Most existing studies instruct LLM students to perform code translation and evaluate its performance by either running the generated output through test suites or comparing it to the reference (ground truth) output. However, this output may contain not only executable source code, but also additional non-code elements, such as natural language interpretations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is essential to understand and address differences in output format, as non-programmatic elements can interfere with evaluation metrics, leading to biased evaluations of model performance and comparisons. We conduct an empirical analysis of the outputs of eleven instruction-tuned open source Master of Business Administration (LLM) programs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of the output produced by the LLMs we evaluated required post-processing. To mitigate output format bias, we propose a strategic combination of fast engineering and regular expressions that efficiently extracts source code from mixed format output, enabling the eleven open source models to achieve an average code extraction success rate (CSR) of 92.73%. Our experimental study confirms that output format bias affects widely used implementation-based metrics, i.e., arithmetic accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU, and CrystalBLEU. In addition, we tested five closed source LLM programs and observed that they also generate different distributions of output formats, which may lead to biases in the output formats. Our results highlight the need to mitigate output format bias to enable reliable MBA evaluations of code translation.