ImplicitMemBench: Measuring Unconscious Behavioral Adaptation in Large Language Models
Summary: arXiv:2604.08064v2 Announce Type: replace
Abstract: Existing memory benchmarks for LLM agents evaluate explicit recall of facts, yet overlook implicit memory where experience becomes automated behavior without conscious retrieval. This gap is critical: effective assistants must automatically apply learned procedures or avoid failed actions without explicit reminders. We introduce ImplicitMemBench, the first systematic benchmark evaluating implicit memory through three cognitively grounded constructs drawn from standard cognitive-science accounts of non-declarative memory: Procedural Memory (one-shot skill acquisition after interference), Priming (theme-driven bias via paired experimental/control instances), and Classical Conditioning (Conditioned Stimulus–Unconditioned Stimulus (CS–US) associations shaping first decisions.
Our 300-item suite employs a unified Learning/Priming-Interfere-Test protocol with first-attempt scoring. Evaluation of 17 models reveals severe limitations: no model exceeds 66% overall, with top performers DeepSeek-R1 (65.3%), Qwen3-32B (64.1%), and GPT-5 (63.0%) far below human baselines. Analysis uncovers dramatic asymmetries (inhibition 17.6% vs. preference 75.0%) and universal bottlenecks requiring architectural innovations beyond parameter scaling. ImplicitMemBench reframes evaluation from “what agents recall” to “what they automatically enact”.
Introduction to ImplicitMemBench
The rapid evolution of large language models (LLMs) has brought forth a need for more sophisticated evaluation metrics that go beyond traditional benchmarks focused on explicit memory. ImplicitMemBench addresses this need by focusing on implicit memory—the type of memory that influences behavior without conscious awareness. This novel benchmark is designed to measure how well LLMs can automate learned behaviors, a crucial capability for creating effective AI assistants.
Key Constructs of ImplicitMemBench
ImplicitMemBench is grounded in three primary constructs from cognitive science, which are essential for understanding non-declarative memory:
- Procedural Memory: This involves the ability to acquire skills through practice and experience, showcasing how LLMs can learn a task after interference.
- Priming: This construct assesses the bias in responses driven by prior exposure to themes or concepts, indicating how LLMs may react differently based on previous instances.
- Classical Conditioning: This refers to the associations formed between stimuli, illustrating how LLMs can shape decisions based on conditioned responses.
Findings and Implications
Through rigorous testing, the ImplicitMemBench uncovered significant limitations in current LLMs. None of the 17 models evaluated surpassed an overall success rate of 66%. The leading models, including DeepSeek-R1, Qwen3-32B, and GPT-5, demonstrated performance levels considerably lower than human benchmarks. This highlights an urgent need for advancements in model architecture and training methodologies.
Moreover, the analysis revealed striking asymmetries in behavior, with models exhibiting a strong preference for certain responses over others, suggesting an inherent bias in the learning processes. This raises critical questions about the reliability of LLMs in real-world applications where automatic responses are necessary.
Conclusion
ImplicitMemBench sets a new standard for evaluating LLMs by shifting the focus from what agents can recall to what they can automatically enact. The insights gained from this benchmark not only highlight the current limitations of LLMs but also pave the way for future innovations in AI design, ensuring that these systems can operate effectively in real-world scenarios.
