Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks
Summary: arXiv:2604.12379v1 Announce Type: cross
As the field of artificial intelligence continues to evolve, large language models (LLMs) are playing an increasingly significant role in solving coding tasks. However, a critical aspect of their effectiveness is the quality of reasoning they employ to arrive at solutions. Evaluating this reasoning has proven to be a complex challenge. Existing evaluators are often not tailored for coding applications, and the benchmarks currently available primarily focus on code generation, leaving a considerable gap in assessing other essential coding tasks.
To address these challenges, researchers have introduced CodeRQ-Bench, the first comprehensive benchmark designed specifically for evaluating LLM reasoning quality across three distinct coding task categories: generation, summarization, and classification. This innovative benchmark not only enhances the understanding of LLM performance but also serves as a vital tool for future research in the field.
Key Findings from CodeRQ-Bench
In an analysis utilizing CodeRQ-Bench, researchers examined 1,069 mismatch cases derived from existing evaluators. This analysis led to the identification of five recurring limitations in current reasoning evaluation methods:
- Lack of Specificity: Many existing evaluators do not account for the unique aspects of coding tasks.
- Insufficient Coverage: Current benchmarks often overlook critical areas such as summarization and classification.
- Overemphasis on Output: A predominant focus on correct output can obscure the reasoning process.
- Inflexibility: Many evaluators are rigid and fail to adapt to various coding contexts.
- Limited Feedback Mechanisms: Current systems provide inadequate feedback to guide improvements in reasoning.
From these limitations, researchers derived four crucial design insights aimed at enhancing reasoning evaluation in coding tasks:
- Contextual Adaptability: Evaluators should adapt to different coding scenarios and tasks.
- Multi-Dimensional Assessment: A broader assessment approach that includes reasoning processes beyond mere output.
- Iterative Feedback: Providing ongoing feedback to improve LLM reasoning capabilities.
- Evidence-Based Verification: Incorporating mechanisms for verifying reasoning with supporting evidence.
Introducing VERA
Guided by the insights gained from the CodeRQ-Bench analysis, the researchers proposed a new evaluation framework known as VERA. This two-stage evaluator combines evidence-grounded verification with ambiguity-aware score correction, enabling a more robust assessment of LLM reasoning quality.
Experimental results on CodeRQ-Bench demonstrate that VERA consistently outperforms established baselines across four different datasets. Notably, VERA has shown improvements in Area Under the Receiver Operating Characteristic Curve (AUCROC) by up to 0.26 and Area Under the Precision-Recall Curve (AUPRC) by up to 0.21, underscoring its effectiveness in evaluating reasoning in coding tasks.
To support ongoing research and exploration in this area, CodeRQ-Bench is publicly available at https://github.com/MrLYG/CodeRQ-Bench, paving the way for future investigations into the reasoning capabilities of large language models.
