GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification
Summary: arXiv:2603.29112v1 Announce Type: new
Introduction
In the fast-evolving landscape of artificial intelligence, understanding user preferences through interaction data remains a significant challenge. GISTBench emerges as a revolutionary benchmark aimed at evaluating the capabilities of Large Language Models (LLMs) in recognizing user interests based on their interaction histories, particularly within recommendation systems.
Benchmark Overview
Traditional recommendation system benchmarks predominantly focus on item prediction accuracy. However, GISTBench shifts this paradigm by emphasizing the extraction and verification of user interests from engagement data. This innovative approach seeks to measure how effectively LLMs can interpret user behavior and preferences, ultimately enhancing the personalization of recommendations.
Proposed Metric Families
To facilitate this evaluation, GISTBench introduces two novel metric families:
- Interest Groundedness (IG): This metric is decomposed into precision and recall components, allowing for a nuanced assessment that separately penalizes hallucinated interest categories while rewarding coverage.
- Interest Specificity (IS): This metric evaluates the distinctiveness of verified LLM-predicted user profiles, ensuring that the interests identified by the models are not only accurate but also uniquely representative of the users.
Dataset and Methodology
To support the evaluation process, GISTBench provides a synthetic dataset constructed from real user interactions on a global short-form video platform. This dataset encompasses both implicit and explicit engagement signals, complemented by rich textual descriptions to provide a more comprehensive understanding of user behaviors. Furthermore, the fidelity of the dataset has been validated against user surveys, ensuring its reliability for research and development purposes.
Evaluation of LLMs
In conjunction with the dataset, GISTBench has been utilized to evaluate eight open-weight LLMs, with parameter sizes ranging from 7 billion to 120 billion. The evaluation process aims to identify performance bottlenecks and areas for improvement in current LLMs, particularly focusing on their ability to accurately count and attribute engagement signals across diverse interaction types.
Key Findings
The preliminary findings from the GISTBench evaluations reveal significant challenges faced by existing LLMs. Notably, many models demonstrate limited proficiency in accurately interpreting and attributing engagement signals, which can hinder their effectiveness in understanding user preferences. This insight underscores the need for further research and refinement in LLM design and training to enhance their performance in real-world applications.
Conclusion
GISTBench represents a pivotal advancement in the assessment of LLMs, shifting the focus towards user understanding and interest verification. By introducing innovative metrics and a robust dataset, this benchmark lays the groundwork for future research aimed at improving the personalization capabilities of recommendation systems powered by LLMs. The findings from GISTBench not only highlight existing limitations but also pave the way for enhanced user-centric AI solutions.
