CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
In a groundbreaking study, researchers have introduced CheeseBench, a novel benchmark designed to evaluate large language models (LLMs) on various classical behavioral neuroscience paradigms. This study, documented in the paper arXiv:2604.10825v1, aims to shed light on the cognitive abilities of LLMs by simulating tasks traditionally used in rodent behavioral studies.
About CheeseBench
CheeseBench encompasses nine well-known behavioral neuroscience paradigms, including:
- Morris water maze
- Barnes maze
- T-maze
- Radial arm maze
- Star maze
- Operant chamber
- Shuttle box
- Conditioned place preference
- Delayed non-match to sample
These paradigms cover six distinct cognitive dimensions, allowing for a comprehensive assessment of LLMs. Each task is rooted in peer-reviewed rodent protocols, with approximate animal baselines provided for comparison.
Methodology
The benchmark presents a unique challenge to the models involved. Each agent receives a unified system prompt devoid of task-specific instructions, compelling them to discover goals solely from ASCII text observations and reward signals. This setup mirrors the experience of a rodent placed in an unfamiliar environment, emphasizing the model’s ability to learn and adapt without prior guidance.
Model Evaluation
The study evaluates six open-weight LLMs with parameter sizes ranging from 3 billion to 72 billion. The performance of these models is assessed through text-based ASCII renderings, with results compared against both a random baseline and a graph-based reinforcement learning agent. Notably, the model achieving the highest success rate was Qwen2.5-VL-7B, which attained an average success rate of 52.6% on the ASCII input tasks.
Key Findings
The research yielded several critical insights:
- Scaling beyond 7 billion parameters results in diminishing returns.
- Longer context history appears to degrade performance.
- Employing chain-of-thought prompting is counterproductive.
- A vision-language architecture provides advantages at 7 billion parameters but proves detrimental at 32 billion parameters.
Interestingly, the performance of the same model varied significantly, ranging from 20% to 57%, depending solely on interface parameters. This highlights the importance of the agent-plus-interface system rather than the model in isolation.
Conclusion
Under this unified zero-shot ASCII protocol, the current open-weight LLM agents have demonstrated performance levels that remain significantly below approximate rodent reference values. The findings especially underscore challenges in tasks that require spatial navigation and within-trial state tracking. CheeseBench represents a significant step forward in evaluating the cognitive capabilities of LLMs and opens avenues for further research in understanding and enhancing these models.
