MEMENTO: Teaching LLMs to Manage Their Own Context
Recently, researchers introduced a groundbreaking method called MEMENTO that aims to enhance the reasoning capabilities of large language models (LLMs). This innovation, detailed in the paper titled “MEMENTO: Teaching LLMs to Manage Their Own Context” and available on arXiv as 2604.09852v1, addresses a significant limitation in current AI models: the ability to manage and compress their reasoning processes effectively.
Abstract Overview
Traditional reasoning models typically operate in long, unstructured streams of information, lacking a mechanism to efficiently summarize or organize their intermediate states. MEMENTO changes this paradigm by teaching models to segment their reasoning into manageable blocks. Each block is then compressed into what the researchers term a “memento,” a dense summary that allows the models to focus on these mementos for future reasoning tasks. This approach not only reduces the amount of context needed but also optimizes key-value (KV) cache usage and computational resources.
OpenMementos Dataset
To facilitate the training of MEMENTO models, the researchers have released a novel public dataset known as OpenMementos. This dataset consists of 228,000 reasoning traces derived from OpenThoughts-v3, which have been meticulously segmented and annotated with intermediate summaries. The availability of this dataset is expected to accelerate research and development in the field of AI reasoning.
Training Methodology
The researchers employed a two-stage supervised fine-tuning (SFT) recipe on the OpenMementos dataset, which has proven effective across various model families, including Qwen3, Phi-4, and Olmo 3, with parameter scales ranging from 8 billion to 32 billion. The results have been promising, demonstrating that models trained using MEMENTO maintain high accuracy in diverse domains such as mathematics, science, and coding benchmarks.
Performance Improvements
One of the standout achievements of the MEMENTO methodology is a remarkable reduction in peak KV cache usage, with a reported improvement of approximately 2.5 times. Furthermore, the researchers extended the capabilities of the vLLM framework to support their new inference method. This enhancement led to an estimated throughput improvement of around 1.75 times, enabling the models to perform reinforcement learning (RL) tasks that further boost their accuracy.
Dual Information Stream
The research also uncovered a dual information stream inherent in the MEMENTO approach. Each reasoning block conveys information through both the memento text and corresponding KV states, which retain implicit information from the original reasoning block. The researchers noted that removing this channel resulted in a significant drop in accuracy, specifically a 15 percentage point decrease on the AIME24 benchmark.
Conclusion
The introduction of MEMENTO marks a significant advancement in the way LLMs can handle and optimize their reasoning processes. By enabling models to manage their context more effectively, MEMENTO not only improves computational efficiency but also enhances the overall accuracy of reasoning tasks. As the field of artificial intelligence continues to evolve, innovations like MEMENTO are critical in shaping the future of intelligent systems.
