Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision
Summary: arXiv:2604.06723v1 Announce Type: cross
In today’s AI-assisted software engineering landscape, developers increasingly depend on Large Language Models (LLMs) that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can significantly reduce developer productivity. To mitigate this issue, providing calibrated confidence scores that accurately reflect the likelihood of correctness at the instance level has become essential. Such information empowers users to make immediate decisions regarding output acceptance, discern error-prone outputs, and better align their expectations with the model’s capabilities.
Despite the advanced nature of post-trained LLMs, they do not inherently yield well-calibrated confidence scores. This limitation has led researchers to develop post-hoc calibration methods. One widely adopted technique is global Platt-scaling of sequence-level confidence scores, which has proven effective in numerous generative software engineering tasks. However, this method remains largely unreliable or unexplored for Automated Code Revision (ACR) tasks, including program repair, vulnerability repair, and code refinement.
The Need for Fine-grained Calibration
Our hypothesis posits that the coarse-grained nature of conventional calibration methods renders them ill-suited for ACR tasks. In ACR, correctness is often determined by local edit decisions, making miscalibration sample-dependent. This necessitates a shift towards fine-grained confidence calibration.
Proposed Methodology
To address the calibration challenges in ACR, our study introduces local Platt-scaling applied separately to three distinct fine-grained confidence scores. This innovative approach aims to enhance the reliability of confidence assessments in ACR tasks.
Experimental Findings
We conducted experiments across three separate ACR tasks and employed different correctness metrics while examining 14 models of various sizes. The results indicate that fine-grained confidence scores consistently produce lower calibration errors across a broader range of probability intervals. Moreover, this effect is further amplified when global Platt-scaling is also applied.
Conclusion
Our proposed methods offer a practical solution for eliciting well-calibrated confidence scores in ACR tasks. By enhancing the trustworthiness of LLM outputs, we aim to streamline the usage of these imperfect models, ultimately benefiting developers and improving productivity.
Key Takeaways
- LLMs are powerful but can produce incorrect outputs, impacting developer productivity.
- Calibrated confidence scores help users make informed decisions regarding model outputs.
- Conventional global Platt-scaling may be inadequate for Automated Code Revision tasks.
- Fine-grained calibration provides more accurate confidence scores tailored to specific tasks.
- Experiments show significant improvements in calibration accuracy with our proposed methods.
