Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation
Summary: arXiv:2604.15482v1 Announce Type: cross
Abstract: Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference.
We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.
Introduction
As the deployment of Large Language Models (LLMs) continues to expand, the need for effective and reliable unlearning mechanisms becomes increasingly evident. The challenge lies in the fact that unlearning must address multiple objectives that may conflict with one another.
Challenges in LLM Unlearning
Current unlearning methods often exhibit the following limitations:
- Focus on Limited Objectives: Many techniques prioritize unlearning efficacy and utility preservation while neglecting robustness and boundary behavior.
- Task Interference: Simply extending existing methods to accommodate multiple objectives may result in interference, leading to suboptimal performance.
- Robustness Against Adversarial Attacks: Ensuring that models remain robust against probing attacks is critical, yet frequently overlooked in standard methodologies.
The Proposed Framework
To address these challenges, we introduce a multi-objective unlearning framework characterized by:
- Unified Data Representation: Standardizing training data into a cohesive representation minimizes domain gaps, facilitating better model performance.
- Bidirectional Logit Distillation: This innovative technique allows us to draw desired behaviors from a context-instructed teacher model while simultaneously suppressing undesirable behaviors in the student model.
Results and Evaluation
Theoretical and empirical analyses provide strong evidence that our approach effectively aligns domain distributions. By converting seemingly unrelated unlearning tasks into cooperative optimization, we significantly enhance the overall efficacy of the unlearning process.
Evaluation results indicate that our framework achieves state-of-the-art performance across a variety of challenging requirements, demonstrating:
- Sustained unlearning efficacy
- Preservation of general utility
- Enhanced robustness against adversarial attacks
Conclusion
In conclusion, our proposed multi-objective unlearning framework represents a significant advancement in the field of LLMs. By harmonizing various unlearning objectives through innovative data representation and distillation techniques, we pave the way for more reliable and effective unlearning methods in the future.
