BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
Summary: arXiv:2604.11304v1 Announce Type: new
Abstract: Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers.
To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables–including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI.
Key Features of BankerToolBench
BankerToolBench encompasses several important features designed to rigorously evaluate AI agents in the context of investment banking:
- Realistic Workflows: The benchmark is based on actual tasks performed by junior bankers, ensuring that the scenarios are relevant and applicable to real-world situations.
- Multi-Tool Integration: BTB requires the use of various industry-specific tools, ensuring that agents can handle diverse tasks that reflect the complexity of investment banking.
- Comprehensive Deliverables: Agents must produce multiple forms of outputs, including Excel models and presentation decks, which are critical for client-facing roles.
- Automated Evaluation: The benchmark includes an automated scoring system that measures deliverables against over 100 criteria defined by experienced bankers, providing a robust assessment of agent performance.
Performance Insights
Testing 9 frontier models, including the latest iteration, GPT-5.4, revealed significant insights into the capabilities of AI agents in professional settings:
- Subpar Performance: Even the best-performing model, GPT-5.4, failed to meet nearly half of the rubric criteria.
- Client Readiness: Bankers rated 0% of the outputs generated by the AI as client-ready, indicating a critical gap in quality.
- Failure Analysis: The analysis identified key obstacles, such as breakdowns in cross-artifact consistency, which hinder AI’s effectiveness in these workflows.
Future Directions for AI in Investment Banking
The findings from the BankerToolBench highlight several improvement directions for developing more effective AI agents in high-stakes professional workflows:
- Enhanced Training Data: Incorporating more diverse and representative data can help improve the understanding of complex tasks.
- Focus on Consistency: Addressing issues related to cross-artifact consistency can lead to more reliable outputs.
- Collaboration with Professionals: Ongoing collaboration with industry experts can guide the development of AI solutions that truly meet the needs of investment banking.
In conclusion, BankerToolBench sets a new standard for evaluating AI agents within the investment banking sector. By addressing the challenges identified through rigorous testing, the path forward for AI in this high-value profession can be significantly enhanced.
