WARP: Guaranteed Inner-Layer Repair of NLP Transformers
Summary: arXiv:2604.00938v1 Announce Type: cross
Abstract
Transformer-based natural language processing (NLP) models have revolutionized the field, yet they remain vulnerable to adversarial perturbations. Existing repair methods often face a fundamental trade-off: while gradient-based approaches provide flexibility, they lack verifiability and frequently overfit to data. Conversely, methods that do offer repair guarantees are typically limited to the final layer or constrained to small networks, significantly reducing the parameter search space available for effective repair. This article introduces WARP (Weight-Adjusted Repair with Provability), a novel constraint-based repair framework that extends repair capabilities beyond just the last layer of Transformer models.
Key Features of WARP
WARP formulates the repair process as a convex quadratic program, which is derived from a first-order linearization of the logit gap. This approach enables tractable optimization across a high-dimensional parameter space. Under the condition that the first-order approximation holds, WARP provides three per-sample guarantees:
- Positive Margin Constraint: Ensures correct classification on repaired inputs.
- Preservation Constraints: Maintains the integrity of a designated remain set.
- Certified Robustness Radius: Derived from Lipschitz continuity, allowing for quantifiable robustness against adversarial inputs.
Enhanced Feasibility Across Model Architectures
To address the varying architectures of Transformer models, WARP includes a sensitivity-based preprocessing step that adapts the optimization landscape to ensure feasibility. This adaptability is crucial for applying the WARP framework across different model configurations without compromising on performance.
Convergence and Empirical Validation
WARP employs an iterative optimization procedure that converges to solutions satisfying all specified repair constraints under mild assumptions. The empirical evaluation conducted on encoder-only Transformers, featuring diverse layer architectures, demonstrates that the guarantees provided by WARP hold true in practical settings. Furthermore, these evaluations show a marked improvement in the models’ robustness to adversarial inputs.
Conclusion
The introduction of WARP represents a significant advancement in the realm of NLP Transformer repair. By leveraging principled constraint-based optimization, WARP not only achieves guaranteed and generalizable repair outcomes but also enhances the overall resilience of Transformer models against adversarial threats. This work paves the way for future research that could further refine the capabilities of NLP models, ensuring they remain robust and reliable in real-world applications.
