IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
Summary: arXiv:2604.19298v1 Announce Type: cross
Abstract
We introduce IndiaFinBench, the first publicly available evaluation benchmark specifically designed for assessing large language model (LLM) performance on Indian financial regulatory text. Traditional financial NLP benchmarks predominantly rely on Western financial corpora, such as SEC filings, US earnings reports, and English-language financial news. This focus leaves a significant gap in the coverage of non-Western regulatory frameworks, particularly those relevant to India.
Overview of IndiaFinBench
IndiaFinBench aims to fill this gap by providing a comprehensive dataset consisting of 406 expert-annotated question-answer pairs derived from 192 documents sourced from prominent Indian regulatory bodies, including the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). The benchmark encompasses four distinct task types:
- Regulatory Interpretation: 174 items
- Numerical Reasoning: 92 items
- Contradiction Detection: 62 items
- Temporal Reasoning: 78 items
Annotation Quality
The quality of the annotations has been rigorously validated. A model-based secondary pass achieved a kappa score of 0.918 on contradiction detection, showcasing high reliability in the data. Additionally, a human inter-annotator agreement evaluation involving 60 items yielded a kappa score of 0.611, with an overall agreement rate of 76.7%. This validation process ensures that the benchmark is both reliable and robust for evaluating LLMs.
Model Evaluation
To assess the efficacy of various LLMs, we evaluated twelve different models under zero-shot conditions. The accuracy of these models varied significantly, ranging from 70.4% for Gemma 4 E4B to an impressive 89.7% for Gemini 2.5 Flash. Notably, all models outperformed a non-specialist human baseline, which recorded an accuracy of merely 60.0%. This finding underlines the potential of LLMs in understanding complex financial regulatory texts.
Task Discrimination and Statistical Analysis
Among the different tasks, numerical reasoning emerged as the most discriminative, with a notable 35.9 percentage-point spread in performance across the evaluated models. To further validate these findings, we conducted bootstrap significance testing with 10,000 resamples, which identified three statistically distinct performance tiers among the models. This statistical rigor enhances the credibility of IndiaFinBench as a reliable benchmark for future research.
Availability
The dataset, evaluation code, and all model outputs related to IndiaFinBench are publicly available at the following link: https://github.com/rajveerpall/IndiaFinBench. Researchers and practitioners in the field of financial NLP are encouraged to utilize this resource to further advance the understanding and application of LLMs in the context of Indian financial regulations.
