PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection
Published: 06 Apr 2026 | Last Modified: 06 Apr 2026 | Type: New | arXiv: 2604.05424v1
Abstract
The emergence of reasoning models, exemplified by OpenAI o1, signifies a transition from intuitive to deliberative cognition, effectively reorienting the scaling laws from pre-training paradigms toward test-time computation. While Monte Carlo Tree Search (MCTS) has shown promise in this domain, existing approaches typically treat each rollout as an isolated trajectory. This lack of information sharing leads to severe inefficiency and substantial computational redundancy, as the search process fails to leverage insights from prior explorations.
Introduction
To address these limitations, we propose PRISM-MCTS, a novel reasoning framework that draws inspiration from human parallel thinking and reflective processes. This innovative approach integrates a Process Reward Model (PRM) with a dynamic shared memory, capturing both “Heuristics” and “Fallacies” in reasoning tasks.
Key Features of PRISM-MCTS
- Process Reward Model (PRM): A core component that reinforces successful strategies and prunes error-prone branches.
- Dynamic Shared Memory: This feature allows for the retention of insights from previous reasoning trajectories, enhancing overall efficiency.
- Metacognitive Reflection: Drawing parallels with human cognition, this aspect allows the model to reflect on its reasoning process, leading to improved decision-making.
Methodology
PRISM-MCTS employs a data-efficient training strategy for the PRM, which is particularly advantageous in scenarios where labeled data is scarce. By utilizing a few-shot learning regime, the model achieves high-fidelity evaluation across various reasoning benchmarks. This innovative approach not only reduces the trajectory requirements significantly but also enhances the model’s scalability and efficiency.
Empirical Evaluations
Empirical evaluations across diverse reasoning benchmarks substantiate the efficacy of PRISM-MCTS. Notably, our model halves the trajectory requirements on the Generalized Question Answering (GPQA) task while surpassing existing methods like MCTS-RAG and Search-o1.
Conclusion
PRISM-MCTS represents a significant advancement in the field of Natural Language Processing (NLP) by leveraging metacognitive reflection and shared memory to enhance reasoning capabilities. This framework not only addresses the inefficiencies of traditional MCTS approaches but also sets a new standard for performance in reasoning tasks.
Keywords
- Efficient/Low-Resource Methods for NLP
- Generation
- Question Answering
For further details and to access the full paper, please refer to arXiv:2604.05424v1.
