INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents
In a significant advancement for the field of natural language processing and computer vision, researchers have introduced INDOTABVQA, a benchmark specifically designed for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. This benchmark aims to bridge the gap in understanding complex table structures in documents across various languages, enhancing the capabilities of Vision-Language Models (VLMs).
Overview of INDOTABVQA
INDOTABVQA comprises a comprehensive dataset that includes 1,593 document images exhibiting three distinct visual styles: bordered, borderless, and colorful. Each image contains either one or multiple tables, along with 1,593 question-answer sets available in four different languages—Bahasa Indonesia, English, Hindi, and Arabic. This multilingual approach facilitates the evaluation of VLMs in:
- Monolingual settings (Bahasa documents with Bahasa questions)
- Cross-lingual settings (Bahasa documents with questions posed in other languages)
Benchmarking Leading Models
The researchers benchmarked several leading open-source VLMs, including:
- Qwen2.5-VL
- Gemma-3
- LLaMA-3.2
- GPT-4o
Findings revealed substantial performance gaps, particularly when dealing with structurally complex tables and in low-resource languages. The analysis highlighted a critical need for enhanced model training and evaluation methods to better tackle these challenges.
Improving Performance Through Fine-Tuning
To enhance the accuracy of these models, the researchers conducted fine-tuning on a compact 3B model and a LoRA-finetuned 7B model using the INDOTABVQA dataset. The results were promising, yielding improvements of:
- 11.6% increase in accuracy for the 3B model
- 17.8% increase in accuracy for the 7B model
Moreover, the study demonstrated that providing explicit table region coordinates as additional input could further elevate performance by an additional 4-7%. This finding underscores the importance of spatial priors in enhancing table-based reasoning capabilities within VLMs.
Significance of INDOTABVQA
INDOTABVQA is not just a benchmark; it is a vital resource aimed at advancing research in cross-lingual, structure-aware document understanding, particularly for underrepresented regions of the world. The dataset encourages the development of more robust language models that can handle diverse languages and complex data structures, addressing a critical need in global AI research.
For those interested in exploring this dataset further, the full collection is accessible on Hugging Face at the following link: INDOTABVQA Dataset.
