SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
In recent years, the field of financial natural language processing (NLP) has witnessed remarkable advancements, particularly in English. Various benchmarks have been developed to enhance capabilities in sentiment analysis, document understanding, and financial question answering. However, the same cannot be said for Arabic financial NLP, which remains relatively under-explored despite a significant demand for reliable financial and Islamic finance assistants in the Arabic-speaking world. To address this gap, a new benchmark has been introduced: SAHM.
Introduction to SAHM
SAHM, short for “Shari’ah-compliant Arabic Financial NLP Benchmark,” is a comprehensive document-grounded benchmark and instruction-tuning dataset tailored for Arabic financial NLP and Shari’ah-compliant reasoning. This innovative resource consists of 14,380 expert-verified instances that cover seven distinct tasks, making it a versatile tool for various applications within the field.
Key Features of SAHM
The SAHM benchmark encompasses a diverse range of tasks, each designed to test different aspects of financial reasoning and understanding in the Arabic language. The tasks included in SAHM are:
- AAOIFI standards Question Answering (QA)
- Fatwa-based QA/Multiple Choice Questions (MCQ)
- Accounting and business examinations
- Financial sentiment analysis
- Extractive summarization
- Event-cause reasoning
These tasks have been carefully curated from authentic regulatory, juristic, and corporate sources, ensuring that the data is both relevant and reliable for researchers and developers in the field.
Evaluation of Language Models
To assess the effectiveness of the SAHM benchmark, a comparative evaluation was conducted using 19 strong open and proprietary large language models (LLMs). The evaluation utilized task-specific metrics alongside rubric-based scoring for open-ended outputs. The findings revealed a critical insight: proficiency in Arabic does not necessarily correlate with the ability to perform evidence-grounded financial reasoning effectively.
Specifically, the models demonstrated significantly stronger performance on recognition-style tasks compared to generation and causal reasoning tasks. The most pronounced gaps were observed in event-cause reasoning, highlighting an area where further improvement is needed.
Future Implications
The introduction of the SAHM benchmark represents a pivotal step towards advancing Arabic financial NLP and facilitating research in Shari’ah-compliant reasoning. By releasing this benchmark, along with its evaluation framework and an instruction-tuned model, the creators aim to foster further exploration and development within this crucial domain.
As the demand for trustworthy Arabic financial assistants continues to grow, resources like SAHM will play an essential role in bridging the gap between technological capabilities and user needs in the Arabic-speaking financial landscape.
Conclusion
In conclusion, SAHM stands as a benchmark that not only addresses the existing challenges in Arabic financial NLP but also sets the stage for future innovations. By providing a structured approach to evaluating and enhancing financial reasoning in Arabic, SAHM has the potential to significantly impact the development of reliable financial solutions in the Arabic-speaking world.
