FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks
Summary: arXiv:2604.10015v1 Announce Type: new
Abstract: Recent studies demonstrate that tool-calling capability enables large language models (LLMs) to interact with external environments for long-horizon financial tasks. While existing benchmarks have begun evaluating financial tool calling, they focus on limited scenarios and rely on call-level metrics that fail to capture trajectory-level reasoning quality.
To address this gap, we introduce FinTrace, a benchmark comprising 800 expert-annotated trajectories spanning 34 real-world financial task categories across multiple difficulty levels. FinTrace employs a rubric-based evaluation protocol with nine metrics organized along four axes:
- Action correctness
- Execution efficiency
- Process quality
- Output quality
This framework enables a fine-grained assessment of LLM tool-calling behavior, which is crucial for understanding how effectively these models can operate in practical financial contexts.
Our evaluation of 13 LLMs reveals significant insights into their capabilities. Notably, while leading models demonstrate strong tool selection abilities, they encounter obstacles in two critical areas:
- Information Utilization: Many models struggle to effectively use the information provided by the tools they invoke.
- Final Answer Quality: There is a notable gap between selecting the appropriate tools and producing high-quality, accurate outputs.
This discrepancy highlights a fundamental challenge in LLM tool-calling—merely invoking the right tools does not guarantee effective reasoning over the outputs generated by those tools.
To further advance the field, we have constructed FinTrace-Training, which is the first trajectory-level preference dataset specifically designed for financial tool calling. This dataset comprises 8,196 curated trajectories, complete with tool-augmented contexts and preference pairs. We conducted an experimental fine-tuning of Qwen-3.5-9B using a two-step process:
- Supervised fine-tuning
- Direct preference optimization (DPO)
Our findings indicate that training on FinTrace-Training consistently enhances intermediate reasoning metrics. Furthermore, DPO is particularly effective in suppressing failure modes that typically hinder performance. However, it is important to note that end-to-end answer quality remains a significant bottleneck. This suggests that while trajectory-level improvements have been achieved, they do not yet fully translate into enhanced final output quality.
In conclusion, FinTrace represents a meaningful step forward in benchmarking LLMs for financial tasks, providing a comprehensive evaluation framework that addresses existing shortcomings. As the field continues to evolve, it is essential to focus on bridging the gap between trajectory-level reasoning and high-quality output generation.
