[2502.06193] Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

PDF view of the paper entitled LLMS can replace human residents? Experimental study by LLM-AS-A-DECHED in software engineering, written by Ruiqi Wang and 5 other authors
PDF HTML (experimental) view
a summary:Recently, large LLMS models have been published to address the tasks of various software engineering (SE) such as the generation of the code, which greatly develops the automation of SE’s tasks. However, the evaluation of the quality of this llm code and the text is still difficult. MACER@K commonly used requires the wide unit tests and the created environments, and requires a high cost of employment, and is not appropriate to evaluate the text created by LLM. Traditional metrics such as Bleu, which only measure lexical instead of semantic similarity, have also been scrutinized. In response, a new trend for LLMS employment appeared for automatic evaluation, known as LLM-AS-A-DeCly. It is claimed that these LLM-AS-A-Dugy methods simulate human evaluation better than traditional measures without relying on high-quality reference answers. However, the human accurate alignment of SE’s tasks is still not exposed. In this paper, we experimentally explore the LLM-AS-A-Dugy methods to assess the SE tasks, focusing on the alignment of human rulings. We choose seven llm-AS-A-DE-Duce styles that use LLMS for general purposes, along with two LLMS seized specifically for evaluation. After generating LLM responses and handing them manually to three modern SE data sets from translating code, generating the code and summarizing the code, we ask these methods to evaluate each response. Finally, we compare the degrees resulting from these methods with human evaluation. The results indicate that the output based methods reach the highest association of Person of 81.32 and 68.51 with human degrees in translating and generating the code, which achieves a semi -human assessment, significantly outperform the ChrF ++, one of the best traditional measures, at 34.23 and 64.92. These output -based methods demand LLMS to direct the rulings directly, and show more balanced distributions in grades that resemble human grades. Finally, we offer …
The application date
From: Ruiqi Wang [view email]
[v1]
Monday, Feb 10 2025 06:49:29 UTC (504 KB)
[v2]
Thursday, 10 April 2025 07:33:55 UTC (530 KB)
Don’t miss more hot News like this! AI/" target="_blank" rel="noopener">Click here to discover the latest in AI news!
2025-04-11 04:00:00