CheeseBench: Benchmarking LLMs on Rodent Neuroscience Tasks

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

In a groundbreaking study, researchers have introduced CheeseBench, a novel benchmark designed to evaluate large language models (LLMs) on various classical behavioral neuroscience paradigms. This study, documented in the paper arXiv:2604.10825v1, aims to shed light on the cognitive abilities of LLMs by simulating tasks traditionally used in rodent behavioral studies.

About CheeseBench

CheeseBench encompasses nine well-known behavioral neuroscience paradigms, including:

Morris water maze
Barnes maze
T-maze
Radial arm maze
Star maze
Operant chamber
Shuttle box
Conditioned place preference
Delayed non-match to sample

These paradigms cover six distinct cognitive dimensions, allowing for a comprehensive assessment of LLMs. Each task is rooted in peer-reviewed rodent protocols, with approximate animal baselines provided for comparison.

Methodology

The benchmark presents a unique challenge to the models involved. Each agent receives a unified system prompt devoid of task-specific instructions, compelling them to discover goals solely from ASCII text observations and reward signals. This setup mirrors the experience of a rodent placed in an unfamiliar environment, emphasizing the model’s ability to learn and adapt without prior guidance.

Model Evaluation

The study evaluates six open-weight LLMs with parameter sizes ranging from 3 billion to 72 billion. The performance of these models is assessed through text-based ASCII renderings, with results compared against both a random baseline and a graph-based reinforcement learning agent. Notably, the model achieving the highest success rate was Qwen2.5-VL-7B, which attained an average success rate of 52.6% on the ASCII input tasks.

Key Findings

The research yielded several critical insights:

Scaling beyond 7 billion parameters results in diminishing returns.
Longer context history appears to degrade performance.
Employing chain-of-thought prompting is counterproductive.
A vision-language architecture provides advantages at 7 billion parameters but proves detrimental at 32 billion parameters.

Interestingly, the performance of the same model varied significantly, ranging from 20% to 57%, depending solely on interface parameters. This highlights the importance of the agent-plus-interface system rather than the model in isolation.

Conclusion

Under this unified zero-shot ASCII protocol, the current open-weight LLM agents have demonstrated performance levels that remain significantly below approximate rodent reference values. The findings especially underscore challenges in tasks that require spatial navigation and within-trial state tracking. CheeseBench represents a significant step forward in evaluating the cognitive capabilities of LLMs and opens avenues for further research in understanding and enhancing these models.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CheeseBench: Benchmarking LLMs on Rodent Neuroscience Tasks

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

About CheeseBench

Methodology

Model Evaluation

Key Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related