IndicDB — Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Summary: arXiv:2604.13686v1 Announce Type: cross
Abstract
While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages.
Introduction
IndicDB is designed to bridge the gap in multilingual capabilities by providing a comprehensive set of relational schemas sourced from open-data platforms such as the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP). This ensures a realistic representation of administrative data complexity.
Key Features
- Diverse Database Coverage: IndicDB comprises 20 databases containing a total of 237 tables.
- Iterative Framework: We employ a three-agent framework consisting of Architect, Auditor, and Refiner to convert denormalized government data into rich relational structures.
- Structural Rigor: Our approach ensures high relational density, achieving an average of 11.85 tables per database with join depths reaching up to six.
- Task Generation: The pipeline is value-aware, difficulty-calibrated, and join-enforced, generating a total of 15,617 tasks across English, Hindi, and five other Indic languages.
Evaluation
We conducted an evaluation of state-of-the-art models such as DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3 to assess their cross-lingual semantic parsing performance across seven linguistic variants. The findings revealed:
- A significant 9.00% performance drop when transitioning from English to Indic languages.
- This drop is attributed to complex schema linking, increased structural ambiguity, and limited external knowledge for Indic languages.
Conclusion
IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL capabilities, addressing the unique challenges posed by Indic languages. It aims to enhance the understanding and development of AI systems that can effectively parse and process data across diverse linguistic backgrounds.
Access and Further Information
For more details, code, and data, please visit: IndicDB Resource.
