INDOTABVQA: Cross-Lingual Table VQA Benchmark for Bahasa

Date:

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

In a significant advancement for the field of natural language processing and computer vision, researchers have introduced INDOTABVQA, a benchmark specifically designed for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. This benchmark aims to bridge the gap in understanding complex table structures in documents across various languages, enhancing the capabilities of Vision-Language Models (VLMs).

Overview of INDOTABVQA

INDOTABVQA comprises a comprehensive dataset that includes 1,593 document images exhibiting three distinct visual styles: bordered, borderless, and colorful. Each image contains either one or multiple tables, along with 1,593 question-answer sets available in four different languages—Bahasa Indonesia, English, Hindi, and Arabic. This multilingual approach facilitates the evaluation of VLMs in:

  • Monolingual settings (Bahasa documents with Bahasa questions)
  • Cross-lingual settings (Bahasa documents with questions posed in other languages)

Benchmarking Leading Models

The researchers benchmarked several leading open-source VLMs, including:

  • Qwen2.5-VL
  • Gemma-3
  • LLaMA-3.2
  • GPT-4o

Findings revealed substantial performance gaps, particularly when dealing with structurally complex tables and in low-resource languages. The analysis highlighted a critical need for enhanced model training and evaluation methods to better tackle these challenges.

Improving Performance Through Fine-Tuning

To enhance the accuracy of these models, the researchers conducted fine-tuning on a compact 3B model and a LoRA-finetuned 7B model using the INDOTABVQA dataset. The results were promising, yielding improvements of:

  • 11.6% increase in accuracy for the 3B model
  • 17.8% increase in accuracy for the 7B model

Moreover, the study demonstrated that providing explicit table region coordinates as additional input could further elevate performance by an additional 4-7%. This finding underscores the importance of spatial priors in enhancing table-based reasoning capabilities within VLMs.

Significance of INDOTABVQA

INDOTABVQA is not just a benchmark; it is a vital resource aimed at advancing research in cross-lingual, structure-aware document understanding, particularly for underrepresented regions of the world. The dataset encourages the development of more robust language models that can handle diverse languages and complex data structures, addressing a critical need in global AI research.

For those interested in exploring this dataset further, the full collection is accessible on Hugging Face at the following link: INDOTABVQA Dataset.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.