INDOTABVQA: Cross-Lingual Table VQA Benchmark for Bahasa

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

In a significant advancement for the field of natural language processing and computer vision, researchers have introduced INDOTABVQA, a benchmark specifically designed for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. This benchmark aims to bridge the gap in understanding complex table structures in documents across various languages, enhancing the capabilities of Vision-Language Models (VLMs).

Overview of INDOTABVQA

INDOTABVQA comprises a comprehensive dataset that includes 1,593 document images exhibiting three distinct visual styles: bordered, borderless, and colorful. Each image contains either one or multiple tables, along with 1,593 question-answer sets available in four different languages—Bahasa Indonesia, English, Hindi, and Arabic. This multilingual approach facilitates the evaluation of VLMs in:

Monolingual settings (Bahasa documents with Bahasa questions)
Cross-lingual settings (Bahasa documents with questions posed in other languages)

Benchmarking Leading Models

The researchers benchmarked several leading open-source VLMs, including:

Qwen2.5-VL
Gemma-3
LLaMA-3.2
GPT-4o

Findings revealed substantial performance gaps, particularly when dealing with structurally complex tables and in low-resource languages. The analysis highlighted a critical need for enhanced model training and evaluation methods to better tackle these challenges.

Improving Performance Through Fine-Tuning

To enhance the accuracy of these models, the researchers conducted fine-tuning on a compact 3B model and a LoRA-finetuned 7B model using the INDOTABVQA dataset. The results were promising, yielding improvements of:

11.6% increase in accuracy for the 3B model
17.8% increase in accuracy for the 7B model

Moreover, the study demonstrated that providing explicit table region coordinates as additional input could further elevate performance by an additional 4-7%. This finding underscores the importance of spatial priors in enhancing table-based reasoning capabilities within VLMs.

Significance of INDOTABVQA

INDOTABVQA is not just a benchmark; it is a vital resource aimed at advancing research in cross-lingual, structure-aware document understanding, particularly for underrepresented regions of the world. The dataset encourages the development of more robust language models that can handle diverse languages and complex data structures, addressing a critical need in global AI research.

For those interested in exploring this dataset further, the full collection is accessible on Hugging Face at the following link: INDOTABVQA Dataset.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

INDOTABVQA: Cross-Lingual Table VQA Benchmark for Bahasa

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Overview of INDOTABVQA

Benchmarking Leading Models

Improving Performance Through Fine-Tuning

Significance of INDOTABVQA

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related