Xuanwu: Evolving General Multimodal Models into an Industrial-Grade Foundation for Content Ecosystems
Summary: arXiv:2603.29211v1 Announce Type: new
Abstract: In recent years, multimodal large models have continued to improve on general benchmarks. However, in real-world content moderation and adversarial settings, mainstream models still suffer from degraded generalization and catastrophic forgetting because of limited fine-grained visual perception and insufficient modeling of long-tail noise. In this paper, we present Xuanwu VL-2B as a case study of how general multimodal models can be developed into an industrial-grade foundation model for content ecosystems.
Introduction
The advancement of artificial intelligence (AI) has led to the emergence of multimodal large models, which combine various types of data such as text, images, and audio. These models have shown significant improvements in general benchmarks. However, they still face challenges in real-world applications, particularly in content moderation and adversarial environments. This article discusses Xuanwu VL-2B, a novel model designed to tackle these issues effectively.
Model Architecture
Xuanwu VL-2B adopts a compact architecture comprising InternViT-300M, MLP, and Qwen3 1.7B. This design enables the model to strike a balance between:
- Fine-grained visual perception
- Language-semantic alignment
- Deployment costs
All of these features fit within an approximately 2B-parameter budget, making it suitable for industrial applications.
Training Methodology
To ensure that the model retains its general capabilities while specializing in specific business applications, Xuanwu employs a robust data iteration and curation mechanism. The training process follows a progressive three-stage pipeline:
- Pre-training: Initial training on a broad dataset to build foundational capabilities.
- Mid-training: Fine-tuning with more specific data to enhance performance on targeted tasks.
- Post-training: Final adjustments to optimize the model for real-world applications.
Performance Evaluation
Ablation studies and offline evaluations indicate that Xuanwu VL-2B significantly outperforms existing models. It achieves an average score of 67.90 across seven OpenCompass multimodal metrics, compared to 64.27 for InternVL 3.5 2B. Moreover, the model records an impressive average recall of 94.38% over seven independent business moderation tasks. This performance is particularly notable in challenging adversarial OCR scenarios, where it achieves a weighted overall recall of 82.82%, surpassing Gemini-2.5-Pro, which scored 76.72%.
Conclusion
The results of the Xuanwu VL-2B model illustrate its potential as an industrial-grade foundation for content ecosystems. By balancing business alignment, visual perception, and general capability retention within a constrained parameter budget, Xuanwu sets a new standard for multimodal models in real-world applications. This advancement not only enhances content moderation but also addresses the pressing need for models that can adapt to complex, adversarial environments.
