Selective Forgetting for Large Reasoning Models
Summary: arXiv:2604.03571v1 Announce Type: new
Large Reasoning Models (LRMs) have gained prominence due to their ability to generate structured chains of thought (CoTs) before arriving at final answers. While this capability enhances their reasoning processes, it also exposes them to significant vulnerabilities, particularly in terms of knowledge leakage through intermediate reasoning steps. The retention of sensitive information from training data, including copyrighted and private content, has raised serious ethical and legal concerns within the field.
In response to these pressing issues, selective forgetting, commonly referred to as machine unlearning, has surfaced as a potential solution for LRMs. However, it is important to note that existing unlearning methods predominantly focus on the final answers generated by these models. This limitation can lead to a deterioration of the overall reasoning abilities of LRMs following the unlearning process. Furthermore, the direct application of unlearning techniques on the entire CoT may inadvertently impair the general reasoning capabilities of these models.
Challenges in LRM Unlearning
The primary challenge facing LRM unlearning lies in achieving precise removal of targeted knowledge while simultaneously preserving the integrity of general reasoning capabilities. The delicate balance between unlearning sensitive information and maintaining the model’s performance is crucial.
Proposed Framework for Selective Forgetting
To bridge the gap between unlearning and reasoning integrity, our research introduces a novel framework aimed at selectively removing sensitive reasoning components without compromising general reasoning capabilities. The key features of our approach include:
- Multiple LLMs with Retrieval-Augmented Generation (RAG): Our framework leverages multiple large language models equipped with RAG to analyze CoT traces effectively.
- Identification of Forget-Relevant Segments: The framework identifies segments within the reasoning chains that require unlearning, focusing on sensitive content.
- Replacement with Benign Placeholders: Sensitive components are replaced with benign placeholders that maintain the logical structure of the reasoning chain.
- Feature Replacement Unlearning Loss: We introduce a new loss function that suppresses the probability of generating forgotten content while reinforcing the generation of structurally valid replacements.
Empirical Validation
To validate the efficacy of our proposed method, we conducted extensive experiments on both synthetic and medical datasets. The results confirm the desired properties of our selective forgetting framework, demonstrating that it effectively removes sensitive information while preserving the model’s reasoning capabilities.
In conclusion, our research presents a significant step toward addressing the ethical and legal challenges associated with Large Reasoning Models. By implementing selective forgetting, we can foster a more responsible deployment of these powerful AI systems, ensuring that they operate within ethical boundaries while retaining their robust reasoning capabilities.
