Signal Reshaping for GRPO to Boost Weak-Feedback Code Repair

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Abstract: Code-agent reinforcement learning (RL) frequently encounters weak feedback, where rollout-time signals are reliable and executable yet only capture necessary or superficial conditions for task success. This often results in a mismatch between the feedback received and the actual semantic predicate that the model is trying to optimize. In this study, we focus on agentic compile-fix as our experimental setting and investigate signal reshaping for standard Generalized Relative Policy Optimization (GRPO) under weak feedback circumstances.

Our central proposition is that GRPO’s within-group comparison becomes meaningful only after reshaping three specific types of signals. These signals include outcome rewards that recover semantic ranking, process signals that localize intra-trajectory credit, and rollouts from the same prompt that remain execution-comparable. We develop a minimal signal-reshaping framework that preserves GRPO’s group-normalized advantage construction while enhancing its effectiveness.

Key Components of Signal Reshaping

Compile-and-Semantic Layered Rewards: These rewards are designed to reshape trajectory ranking and ensure that the model’s outputs are aligned with both compile-time and semantic correctness.
Step-Level Process Scores: By normalizing these scores outside of group reward normalization, we can better reshape the within-trajectory update strength, allowing for more precise adjustments during the training process.
Failure-Cause-Aware Rollout Governance: This mechanism reshapes within-group comparability, ensuring that different trajectories can be compared on a more meaningful basis, even in the presence of weak feedback.

Experimental Results

Our experiments demonstrate a substantial end-to-end improvement in performance. The fully signal-reshaped GRPO model increased strict compile-and-semantic accuracy from the base model’s zero-shot score of 0.385 to an impressive 0.535. This improvement underscores the effectiveness of our signal reshaping approach.

In controlled comparative analyses, we further elucidate the sources of this performance gain. Specifically, we found that utilizing binary rewards tends to eliminate the compile-only middle tier, which can degrade trajectory control. In contrast, when layered rewards are applied, the introduction of process-score weighting enhances accuracy further, raising it from 0.48 to 0.53, while also reducing the average number of evaluation steps from 23.50 to 17.02.

Boundary Comparisons and Implications

As a boundary comparison, we examined privileged-prompt token-level distillation, which primarily optimizes local distributional alignment. However, in scenarios involving long tool-use trajectories, we observed that this signal becomes diluted by non-critical tokens, failing to effectively replace the need for outcome semantics, process credit, or within-group comparability.

Our findings highlight the importance of properly reshaping signals in the training of code-agent RL systems, particularly when dealing with weak feedback. By improving the methods of signal processing, we set a new standard for enhancing the accuracy and efficacy of agentic code repair systems.

The implications of this research extend beyond academic interest; they pave the way for more robust AI systems capable of understanding and executing complex tasks in real-world applications.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

Signal Reshaping for GRPO to Boost Weak-Feedback Code Repair

Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair

Key Components of Signal Reshaping

Experimental Results

Boundary Comparisons and Implications

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related