CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
Summary: arXiv:2604.12913v1 Announce Type: cross
Abstract
Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from “logical hallucinations” and “semantic misalignment” due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework.
Framework Overview
The CoDe-R framework is designed to tackle the challenges presented in binary decompilation by introducing two innovative stages:
- Semantic Cognitive Enhancement (SCE): This stage implements a Rationale-Guided Semantic Injection strategy, training the model to recover high-level algorithmic intent alongside the code. By emphasizing the rationale behind code constructs, SCE enhances the model’s understanding and generation of semantically accurate code.
- Dynamic Dual-Path Fallback (DDPF): The second stage introduces a mechanism that adaptively balances semantic recovery and syntactic stability during inference. This is achieved through a hybrid verification strategy, ensuring that the generated output maintains both functional correctness and adherence to syntactic rules.
Performance Evaluation
The effectiveness of CoDe-R has been evaluated on the HumanEval-Decompile benchmark, demonstrating its capability to set a new State-of-the-Art (SOTA) in the lightweight regime. Notably, CoDe-R, utilizing a model backbone of 1.3 billion parameters, is the first of its kind to achieve an Average Re-executability Rate exceeding 50.00%. This marks a significant advancement in bridging the gap between efficient models and expert-level performance.
Significance of the Study
The implications of CoDe-R are substantial for the field of reverse engineering and software analysis. By addressing logical hallucinations and semantic misalignment, CoDe-R enhances the reliability and accuracy of decompiled outputs. This advancement not only improves the usability of decompilation tools but also empowers developers and security analysts in their efforts to analyze and understand binary code. The successful integration of LLMs into this process represents a pivotal shift towards more intelligent and capable software analysis methodologies.
Future Directions
As the landscape of binary analysis continues to evolve, further research is needed to refine and enhance the capabilities of frameworks like CoDe-R. Potential future directions could include:
- Exploration of larger model backbones for improved performance.
- Incorporation of additional contextual information to enrich the semantic understanding of the decompiled code.
- Expanding the evaluation benchmarks to encompass a wider variety of programming languages and execution environments.
Conclusion
CoDe-R represents a significant leap forward in the domain of binary decompilation, effectively leveraging the strengths of Large Language Models while overcoming their inherent limitations. As the demand for robust and reliable software analysis tools grows, innovations like CoDe-R will play a crucial role in shaping the future of reverse engineering.
For those interested in exploring the implementation of CoDe-R, the code is available at https://github.com/Theaoi/CoDe-R.
