IndicDB: Multilingual Text-to-SQL Benchmark for Indian Languages

IndicDB — Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Summary: arXiv:2604.13686v1 Announce Type: cross

Abstract

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages.

Introduction

IndicDB is designed to bridge the gap in multilingual capabilities by providing a comprehensive set of relational schemas sourced from open-data platforms such as the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP). This ensures a realistic representation of administrative data complexity.

Key Features

Diverse Database Coverage: IndicDB comprises 20 databases containing a total of 237 tables.
Iterative Framework: We employ a three-agent framework consisting of Architect, Auditor, and Refiner to convert denormalized government data into rich relational structures.
Structural Rigor: Our approach ensures high relational density, achieving an average of 11.85 tables per database with join depths reaching up to six.
Task Generation: The pipeline is value-aware, difficulty-calibrated, and join-enforced, generating a total of 15,617 tasks across English, Hindi, and five other Indic languages.

Evaluation

We conducted an evaluation of state-of-the-art models such as DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3 to assess their cross-lingual semantic parsing performance across seven linguistic variants. The findings revealed:

A significant 9.00% performance drop when transitioning from English to Indic languages.
This drop is attributed to complex schema linking, increased structural ambiguity, and limited external knowledge for Indic languages.

Conclusion

IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL capabilities, addressing the unique challenges posed by Indic languages. It aims to enhance the understanding and development of AI systems that can effectively parse and process data across diverse linguistic backgrounds.

Access and Further Information

For more details, code, and data, please visit: IndicDB Resource.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

IndicDB: Multilingual Text-to-SQL Benchmark for Indian Languages

IndicDB — Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Abstract

Introduction

Key Features

Evaluation

Conclusion

Access and Further Information

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related