IndicDB: Multilingual Text-to-SQL Benchmark for Indian Languages

Date:

IndicDB — Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Summary: arXiv:2604.13686v1 Announce Type: cross

Abstract

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages.

Introduction

IndicDB is designed to bridge the gap in multilingual capabilities by providing a comprehensive set of relational schemas sourced from open-data platforms such as the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP). This ensures a realistic representation of administrative data complexity.

Key Features

  • Diverse Database Coverage: IndicDB comprises 20 databases containing a total of 237 tables.
  • Iterative Framework: We employ a three-agent framework consisting of Architect, Auditor, and Refiner to convert denormalized government data into rich relational structures.
  • Structural Rigor: Our approach ensures high relational density, achieving an average of 11.85 tables per database with join depths reaching up to six.
  • Task Generation: The pipeline is value-aware, difficulty-calibrated, and join-enforced, generating a total of 15,617 tasks across English, Hindi, and five other Indic languages.

Evaluation

We conducted an evaluation of state-of-the-art models such as DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3 to assess their cross-lingual semantic parsing performance across seven linguistic variants. The findings revealed:

  • A significant 9.00% performance drop when transitioning from English to Indic languages.
  • This drop is attributed to complex schema linking, increased structural ambiguity, and limited external knowledge for Indic languages.

Conclusion

IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL capabilities, addressing the unique challenges posed by Indic languages. It aims to enhance the understanding and development of AI systems that can effectively parse and process data across diverse linguistic backgrounds.

Access and Further Information

For more details, code, and data, please visit: IndicDB Resource.

Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.