IndiaFinBench: Benchmarking LLMs on Indian Finance Texts

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

Summary: arXiv:2604.19298v1 Announce Type: cross

Abstract

We introduce IndiaFinBench, the first publicly available evaluation benchmark specifically designed for assessing large language model (LLM) performance on Indian financial regulatory text. Traditional financial NLP benchmarks predominantly rely on Western financial corpora, such as SEC filings, US earnings reports, and English-language financial news. This focus leaves a significant gap in the coverage of non-Western regulatory frameworks, particularly those relevant to India.

Overview of IndiaFinBench

IndiaFinBench aims to fill this gap by providing a comprehensive dataset consisting of 406 expert-annotated question-answer pairs derived from 192 documents sourced from prominent Indian regulatory bodies, including the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI). The benchmark encompasses four distinct task types:

Regulatory Interpretation: 174 items
Numerical Reasoning: 92 items
Contradiction Detection: 62 items
Temporal Reasoning: 78 items

Annotation Quality

The quality of the annotations has been rigorously validated. A model-based secondary pass achieved a kappa score of 0.918 on contradiction detection, showcasing high reliability in the data. Additionally, a human inter-annotator agreement evaluation involving 60 items yielded a kappa score of 0.611, with an overall agreement rate of 76.7%. This validation process ensures that the benchmark is both reliable and robust for evaluating LLMs.

Model Evaluation

To assess the efficacy of various LLMs, we evaluated twelve different models under zero-shot conditions. The accuracy of these models varied significantly, ranging from 70.4% for Gemma 4 E4B to an impressive 89.7% for Gemini 2.5 Flash. Notably, all models outperformed a non-specialist human baseline, which recorded an accuracy of merely 60.0%. This finding underlines the potential of LLMs in understanding complex financial regulatory texts.

Task Discrimination and Statistical Analysis

Among the different tasks, numerical reasoning emerged as the most discriminative, with a notable 35.9 percentage-point spread in performance across the evaluated models. To further validate these findings, we conducted bootstrap significance testing with 10,000 resamples, which identified three statistically distinct performance tiers among the models. This statistical rigor enhances the credibility of IndiaFinBench as a reliable benchmark for future research.

Availability

The dataset, evaluation code, and all model outputs related to IndiaFinBench are publicly available at the following link: https://github.com/rajveerpall/IndiaFinBench. Researchers and practitioners in the field of financial NLP are encouraged to utilize this resource to further advance the understanding and application of LLMs in the context of Indian financial regulations.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

IndiaFinBench: Benchmarking LLMs on Indian Finance Texts

IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

Abstract

Overview of IndiaFinBench

Annotation Quality

Model Evaluation

Task Discrimination and Statistical Analysis

Availability

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related