Execution-Verified Reinforcement Learning for Optimization Modeling
Summary: arXiv:2604.00442v1 Announce Type: new
The field of optimization modeling is witnessing a transformative shift, particularly with the integration of large language models (LLMs). While the potential for automating optimization modeling with LLMs is vast, current methodologies face significant challenges. Existing approaches are often contingent on agentic pipelines built around closed-source LLMs, which are hindered by high inference latency. Alternatively, fine-tuning smaller LLMs typically requires expensive process supervision, risking overfitting to specific solver APIs.
In response to these challenges, we introduce Execution-Verified Optimization Modeling (EVOM), a pioneering framework that employs reinforcement learning with verifiable rewards. By treating a mathematical programming solver as a deterministic, interactive verifier, EVOM revolutionizes the approach to optimization modeling.
Key Features of EVOM
- Solver-Specific Code Generation: EVOM generates code tailored to specific solvers based on natural-language problem descriptions.
- Sandboxed Execution: The generated code is executed within a controlled environment, ensuring safety and reliability.
- Scalar Reward Conversion: Execution outcomes are transformed into scalar rewards, which are crucial for the reinforcement learning process.
- Closed-Loop Optimization: The framework employs a closed-loop generate-execute-feedback-update process optimized with Gradient Reinforcement Policy Optimization (GRPO) and Deterministic Actor-Critic Policy Optimization (DAPO).
- Outcome-Only Formulation: This unique approach eliminates the necessity for process-level supervision, reducing complexity and improving efficiency.
- Cross-Solver Generalization: EVOM facilitates the switching of verification environments, allowing for generalization across different solvers without the need for reconstructing solver-specific datasets.
Experimental Validation
Extensive experiments conducted on various datasets, including NL4OPT, MAMO, IndustryOR, and OptiBench, demonstrate the efficacy of EVOM. The framework was tested across different solver backends such as Gurobi, OR-Tools, and COPT. The results indicate that EVOM not only matches but often surpasses the performance of process-supervised Supervised Fine-Tuning (SFT).
Additionally, EVOM showcases impressive capabilities in zero-shot solver transfer, allowing for seamless adaptation to new solvers without extensive retraining. This characteristic is particularly beneficial in real-world applications where the ability to adapt to various optimization environments is crucial.
Conclusion
The introduction of Execution-Verified Optimization Modeling marks a significant advancement in the field of optimization modeling. By leveraging the strengths of reinforcement learning and removing the reliance on costly process supervision, EVOM opens new avenues for developing scalable decision intelligence. The implications of this research extend beyond academic interest, promising practical applications in industries where optimization plays a critical role.
