From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
In recent advancements in artificial intelligence, multimodal large language models (MLLMs) have gained traction for their ability to translate visual artifacts into functional code. This capability spans a wide range of applications, from converting UI mockups into HTML to generating Python scripts from scientific plots. However, a more complex challenge arises when dealing with circuit diagrams, which serve as a visual domain-specific language for hardware design. These diagrams encapsulate critical information about timing, topology, and bit-level semantics—elements that are often overlooked but are vital for ensuring safety once the design is fabricated into silicon.
The translation of circuit diagrams into register-transfer-level (RTL) code represents a rigorous test of reliability for vision-to-code generation systems. A recent study introduced a troubling phenomenon known as “Mirage,” which highlights a significant flaw in the performance of certain MLLMs. Specifically, researchers found that when a circuit diagram was replaced with a blank image, the models’ performance metrics, particularly Pass@k scores, remained unchanged or even improved. This suggests that the models are circumventing the visual input entirely, instead relying on the semantics of identifiers within the module header to extract canonical RTL templates. Such a behavior not only raises concerns about the models’ reliability but also poses a threat to their overall trustworthiness in practical applications.
Key Findings and Methodology
To better understand and quantify the Mirage phenomenon, the researchers developed a benchmarking tool dubbed C2VEVAL. This tool was utilized to evaluate eight different MLLMs under a paired Normal/Anony protocol. In this setup, the Anony mode anonymizes all identifiers present in both the circuit diagram and the module header. The results were telling: scores in Anony mode dropped sharply across all models, confirming that high accuracy observed in Normal mode could be misleading and largely a product of the Mirage effect.
Introducing VeriGround
In light of these findings, the researchers proposed a novel model called VeriGround, which is specifically designed to address the issues revealed by the Mirage phenomenon. VeriGround is trained with several innovative strategies, including:
- Identifier Anonymization: This technique ensures that the model learns to generate code without relying on identifiable semantics.
- Refusal Augmentation: This approach enables the model to decline requests when it cannot confidently produce accurate code.
- D-ORPO (Decision-Focused ORPO) Preference Alignment: This method up-weights pivotal generate-or-refuse tokens, enhancing the model’s decision-making capabilities.
With 4 billion parameters, VeriGround has demonstrated impressive results, achieving a Functional Pass@1 score of 46.11% in Normal mode and 42.51% in Anony mode, while maintaining a low False Refusal Rate of only 1.20% and 0.00%, respectively. Notably, the model exhibits a refusal rate exceeding 92% when presented with blank images, underscoring its capability to discern meaningful input from irrelevant data.
Conclusion
VeriGround’s performance indicates that it can compete with larger models, such as GPT-5.4, in Normal mode, and significantly outperforms all existing baselines in Anony mode. This research not only sheds light on the hidden challenges in AI-assisted code generation but also paves the way for more reliable systems that genuinely understand visual inputs, thus enhancing trust in MLLMs for critical applications in hardware design.
Related AI Insights
- How Generative AI Transforms Google Search & Gemini Results
- Preserving Emotion in Small Model Machine Translation
- Position-Aware Drafting Boosts LLM Recommendation Speed
- ClipTBP: Advanced Temporal Boundary Prediction for Video Retrieval
- Govern LLM Updates: Test Before Deploying Models Safely
- ANCORA: Self-Play AI for Verifiable Reasoning Advances
- Training-Free Tunnel Defect Inspection with Visual Recalibration
- Unified Tensor Learning for Statistical Channel Fingerprints in Massive MIMO
- NeocorRAG: Boost Recall & Evidence Quality in RAG AI
- How LLMs Reflect Human Traits in Societal Debates
