CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion
Summary: arXiv:2604.09101v1 Announce Type: cross
Abstract
In the growing landscape of Machine Learning as a Service (MLaaS), organizations with limited data and computational resources often rely on external providers to train models. These providers adapt advanced vision-language models (VLMs) like CLIP to specific tasks through prompt tuning. However, this setup introduces significant security vulnerabilities. A malicious provider can exploit the prompt-tuning process to implant backdoors, making it possible for certain inputs to be classified into an attacker-specified category, even when those inputs are out-of-distribution (OOD).
Traditional methods focusing on encoder corruption fail to detect these hidden backdoors, as the underlying encoders remain intact. Meanwhile, existing data-level techniques that aim to sanitize data before training or during inference do not effectively address the pivotal question: “Is the delivered model backdoored or not?” To tackle this model-level verification challenge, we introduce CLIP-Inspector (CI), a novel backdoor detection method tailored for prompt-tuned CLIP models.
Functionality of CLIP-Inspector
CLIP-Inspector operates under the assumption of white-box access to the delivered model and leverages a pool of unlabeled OOD images. The primary functionality of CI includes:
- Reconstructing potential triggers for each class.
- Determining if the model exhibits backdoor behavior based on the reconstructed triggers.
Furthermore, we showcase that utilizing CI’s reconstructed trigger for fine-tuning on accurately labeled triggered inputs can realign the model and diminish the effectiveness of any backdoor present.
Experimental Validation
We conducted extensive experiments encompassing ten datasets and four distinct backdoor attack methods. The results indicate that CI is capable of reconstructing effective triggers within a single epoch using merely 1,000 OOD images. The detection accuracy achieved by CI stands at an impressive 94% (47 out of 50 models).
When comparing CI with other adapted trigger-inversion baselines, the performance is markedly superior. CI achieved an Area Under the Receiver Operating Characteristic (AUROC) score of 0.973, significantly higher than the scores of 0.495 and 0.687 reported for the baseline methods. This demonstrates CI’s robust capability in vetting and post-hoc repairing of prompt-tuned CLIP models, ensuring their safe deployment in real-world applications.
Conclusion
As the dependency on MLaaS increases, so does the need for secure and reliable model deployment. CLIP-Inspector emerges as a critical tool for organizations to verify the integrity of prompt-tuned CLIP models, providing a necessary safeguard against backdoor attacks and enhancing the overall security of machine learning applications.
