[2502.06193] Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

1 2 minutes read

250206193 Can LLMs Replace Human Evaluators An Empirical Study of.png

[Submitted on 10 Feb 2025 (v1), last revised 10 Apr 2025 (this version, v2)]

PDF view of the paper entitled LLMS can replace human residents? Experimental study by LLM-AS-A-DECHED in software engineering, written by Ruiqi Wang and 5 other authors

PDF HTML (experimental) view

a summary:Recently, large LLMS models have been published to address the tasks of various software engineering (SE) such as the generation of the code, which greatly develops the automation of SE’s tasks. However, the evaluation of the quality of this llm code and the text is still difficult. MACER@K commonly used requires the wide unit tests and the created environments, and requires a high cost of employment, and is not appropriate to evaluate the text created by LLM. Traditional metrics such as Bleu, which only measure lexical instead of semantic similarity, have also been scrutinized. In response, a new trend for LLMS employment appeared for automatic evaluation, known as LLM-AS-A-DeCly. It is claimed that these LLM-AS-A-Dugy methods simulate human evaluation better than traditional measures without relying on high-quality reference answers. However, the human accurate alignment of SE’s tasks is still not exposed. In this paper, we experimentally explore the LLM-AS-A-Dugy methods to assess the SE tasks, focusing on the alignment of human rulings. We choose seven llm-AS-A-DE-Duce styles that use LLMS for general purposes, along with two LLMS seized specifically for evaluation. After generating LLM responses and handing them manually to three modern SE data sets from translating code, generating the code and summarizing the code, we ask these methods to evaluate each response. Finally, we compare the degrees resulting from these methods with human evaluation. The results indicate that the output based methods reach the highest association of Person of 81.32 and 68.51 with human degrees in translating and generating the code, which achieves a semi -human assessment, significantly outperform the ChrF ++, one of the best traditional measures, at 34.23 and 64.92. These output -based methods demand LLMS to direct the rulings directly, and show more balanced distributions in grades that resemble human grades. Finally, we offer …