CheeseBench: Benchmarking LLMs on Rodent Neuroscience Tasks

Date:

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

In a groundbreaking study, researchers have introduced CheeseBench, a novel benchmark designed to evaluate large language models (LLMs) on various classical behavioral neuroscience paradigms. This study, documented in the paper arXiv:2604.10825v1, aims to shed light on the cognitive abilities of LLMs by simulating tasks traditionally used in rodent behavioral studies.

About CheeseBench

CheeseBench encompasses nine well-known behavioral neuroscience paradigms, including:

  • Morris water maze
  • Barnes maze
  • T-maze
  • Radial arm maze
  • Star maze
  • Operant chamber
  • Shuttle box
  • Conditioned place preference
  • Delayed non-match to sample

These paradigms cover six distinct cognitive dimensions, allowing for a comprehensive assessment of LLMs. Each task is rooted in peer-reviewed rodent protocols, with approximate animal baselines provided for comparison.

Methodology

The benchmark presents a unique challenge to the models involved. Each agent receives a unified system prompt devoid of task-specific instructions, compelling them to discover goals solely from ASCII text observations and reward signals. This setup mirrors the experience of a rodent placed in an unfamiliar environment, emphasizing the model’s ability to learn and adapt without prior guidance.

Model Evaluation

The study evaluates six open-weight LLMs with parameter sizes ranging from 3 billion to 72 billion. The performance of these models is assessed through text-based ASCII renderings, with results compared against both a random baseline and a graph-based reinforcement learning agent. Notably, the model achieving the highest success rate was Qwen2.5-VL-7B, which attained an average success rate of 52.6% on the ASCII input tasks.

Key Findings

The research yielded several critical insights:

  • Scaling beyond 7 billion parameters results in diminishing returns.
  • Longer context history appears to degrade performance.
  • Employing chain-of-thought prompting is counterproductive.
  • A vision-language architecture provides advantages at 7 billion parameters but proves detrimental at 32 billion parameters.

Interestingly, the performance of the same model varied significantly, ranging from 20% to 57%, depending solely on interface parameters. This highlights the importance of the agent-plus-interface system rather than the model in isolation.

Conclusion

Under this unified zero-shot ASCII protocol, the current open-weight LLM agents have demonstrated performance levels that remain significantly below approximate rodent reference values. The findings especially underscore challenges in tasks that require spatial navigation and within-trial state tracking. CheeseBench represents a significant step forward in evaluating the cognitive capabilities of LLMs and opens avenues for further research in understanding and enhancing these models.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.